Professional Documents
Culture Documents
41 PhamVanSon CT1201
41 PhamVanSon CT1201
TRNG I HC DN LP HI PHNG
-------------o0o--------------
HI PHNG 12/2012
0
MC LC
MC LC .......................................................................................................... 1
LI CM N ..................................................................................................... 3
M U ............................................................................................................ 4
1
2.4 LCH S CA PHN TCH QUAN IM V KHAI THC
QUAN IM ............................................................................................... 18
KT LUN ...................................................................................................... 34
2
LI CM N
Hi Phng, ngy..thng..nm.
Sinh vin
Phm Vn Sn
3
M U
Trong thi i hin nay, s pht trin nh v bo ca cng ngh thng tin
(CNTT) ko theo s pht trin ca nhiu lnh vc khc. C th ni, CNTT ang
lm thay i hnh hi ca nn kinh t th gii, gip nhn loi bc nhng bc
vng chc u tin trn con ng ca kinh t tri thc, thng mi in t.. Ngy
nay, con ngi khng cn phi vt v nhc nhn trong cng vic thu thp d liu v
c tr th c lc l h thng my tnh v mng truyn s liu trin khai quy
m ton cu.
Tuy nhin, s pht trin vt bc ca CNTT lm tng s lng giao dch
thng tin trn mng Internet mt cch ng k, c bit l th in t, tin tc in
t,... Theo s liu thng k t Broer et ai (2008) th c sau khong 6 n 10 thng
lng thng tin li tng gp i, bn cnh tc thay i thng tin cng cc
k nhanh. Hot ng ca cc lnh vc cng t ra phi x l mt khi lng thng
tin s. Mt yu cu ln t ra i vi chng ta l lm sao t chc, tm kim
thng tin mt cch hiu qu nht v phn loi thng tin l mt trong nhng gii
php hp l cho yu cu ny. Nhng vi mt khi lng thng tin qu ln v i
hi phi x l nhanh th vic phn loi th cng l iu khng tng. Hng gii
quyt l xy dng cc gii php cho php thut ton ha v chng trnh ha trn
my tnh c th t ng phn loi cc thng tin trn.
4
Mc ch, i tng v phm vi nghin cu
Trong khun kh lun vn s nghin cu phn bi ton phn lp quan im,
c s l thuyt ca phng php SVM v cc vn lin quan. Phn tch nhng
gii php cho php m rng v ci tin nng cao hiu qu ng dng ca SVM.
a k thut m vo SVM cho php phn chia khng gian d liu mt cch tt
hn, nhm loi b nhng vng khng c phn lp bng SVM thng thng.
Trnh by hng p dng k thut SVM cng nh nhng ci tin, m rng
ca n vo gii quyt mt s cc bi ton ng dng trong thc tin.
Trnh by tng quan v bi ton phn lp quan im v c th l bi ton
phn lp phn cc phn chia cc ti liu cha quan im l tch cc hay tiu cc.
Tm hiu d liu quan im v vit chng trnh th nghim phn lp phan
cc ti liu s dng SVM.
5
CHNG 1: TM HIU V SUPPORT VECTOR
MACHINE
6
Hnh 2. 1: Phn tch theo siu phng (w,b) trong khng gian
2 chiu ca tp mu
7
1.1.1 Trnh by tm tt v phn lp d liu
- Phn lp d liu l mt k thut trong khai ph d liu c s dng rng
ri nht v c nghin cu m rng hin nay.
- Mc ch: d on nhng nhn phn lp cho cc b d liu hoc mu mi.
u vo: Mt tp cc mu d liu hun luyn,vi mt nhn phn lp
cho mi mu d liu
u ra: B phn lp da trn tp hun luyn,hoc nhng nhn phn lp
Phn lp d liu da trn tp hun luyn v cc gi tr trong mt thuc tnh
phn lp v dng n xc nh lp cho d liu mi
K thut phn lp d liu c tin hnh bao gm 2 bc:
Bc 1: Xy dng m hnh t tp hun luyn
Bc 2: S dng m hnh kim tra tnh ng n ca m hnh v dng
n phn lp d liu mi.
Bc 1. Xy dng m hnh
8
Bc 2: S dng m hnh
- Phn lp cho nhng i tng mi hoc cha c phn lp
- nh gi chnh xc ca m hnh
9
1.2 THUT TON SVM
1.2.2 nh ngha
Cho trc mt tp hun luyn, c biu din trong khng gian vector, trong
mi ti liu l mt im, phng php ny tm ra mt siu phng quyt nh tt
nht c th chia cc im trn khng gian ny thnh hai lp ring bit tng ng l
lp + v lp -. Cht lng ca siu phng ny c quyt nh bi khong cch (gi
l bin) ca im d liu gn nht ca mi lp n mt phng ny. Khi , khong
cch bin cng ln th mt phng quyt nh cng tt, ng thi vic phn loi cng
chnh xc.
10
Mc ch ca phng php SVM l tm c khong cch bin ln nht,
iu ny c minh ha nh sau:
Hnh 2. 5: Siu phng phn chia d liu hc thnh 2 lp + v - vi khong cch bin ln
nht. Cc im gn nht (im c khoanh trn) l cc Support Vector.
1.2.4.1 C s l thuyt
xi .w + b = 0
+1, Xi . W + b > 0
t f(Xi) = sign (Xi . W + b) =
-1, Xi . W + b < 0
11
Nh vy, f(Xi) biu din s phn lp ca Xi vo hai lp nh nu. Ta ni yi=
+1 nu Xi lp I v yi = -1 nu Xi lp II . Khi , c siu phng f ta s phi
gii bi ton sau:
Tm min w vi W tha mn iu kin sau:
12
Hnh 2. 6: Minh ha bi ton 2 phn lp bng phng php SVM
phn nhiu lp th k thut SVM nguyn thy s chia khng gian d liu
thnh 2 phn v qu trnh ny lp li nhiu ln. Khi hm quyt nh phn d liu
vo lp th i ca tp n , 2-Ip s l:
fi(x) = wiix + bi
Nhng phn t x l support vector s tha iu kin
+1 nu thuc lp i
fi (x) =
-1 nu thuc phn cn li
Nh vy, bi ton phn nhiu lp s dng phng php SVM hon ton c th
thc hin ging nh bi ton hai lp. Bng cch s dng chin lc "mt- i-
mt(one - against - one).
Gi s bi ton cn phn loi c k lp (k > 2), chin lc "mt-i-mts
tin hnh k(k-l)/2 ln phn lp nh phn s dng phng php SVM. Mi lp s
tin hnh phn tch vi k-1 lp cn li xc nh k-1 hm phn tch da vo bi
ton phn hai lp bng phng php SVM.
13
1.2.4.4 Cc bc chnh ca phng php SVM
14
2 CHNG 2: BI TON PHN LP QUAN IM
ng ln. V ngc l
Internet.
Theo nh hai cuc kho st ca hn 2000 ngi M trng thnh mi: 81%
ngi dng Internet (hoc 60% ngi M) thc hi c tuyn v
mt sn phm t nht mt ln 20% (15% ca tt c cc ngi M) lm nh vy
trong mt ngy. Trong s cc c gi nh gi trc tuyn ca nh hng, khch sn,
dch v khc nhau (v d nh, cc c quan du lch hoc bc s), gia 73% v
87% bo co nh gi c mt nh hng ng k mua hng ca h. Ngi tiu
dng sn sng tr t 20% n 99% mt m 5 sao cao hn so vi
15
mt mc nh gi 4 sao. 32% cung cp mt nh gi v mt sn phm, dch v
thng qua mt h thng xp hng trc tuyn, trong c 18% ca cng dn trc
tuyn cao cp, c ng mt bnh lun trc tuyn hoc xem xt v mt sn phm hay
dch v.
hng ha v dch v khng phi l
ng c duy nht hoc th hi trc tuy
. V d, trong mt
cuc kho st hn 2500 ngi M trng thnh, Rainie v Horrigan nghin cu
31% ngi M - trn 60 triu ngi - 2006 ng ,
l nhng ngi thu thp thng tin v cuc bu c nm 2006 trc tuyn v trao
thng qua email.
Trong s ny:
16
quan tm m ngi
dng c nhn trong cc kin trc tuyn v sn phm v dch v, nh
h .
Vi s bng n ca nn tng Web 2.0 , din n tho lun, peer-
to-peer mng, v cc loi khc nhau c ...
cha tng c v quyn chia s kinh nghim v kin ca
ring h c hay tiu cc. Khi cc cng ty
ln ang ngy cng nhn ra, nhng ting ni ca ngi tiu dng c th vn dng
rt ln nh hng trong vic hnh thnh kin ca ngi tiu dng
trung thnh v thng hiu ca h, quyt nh mua,v vn ng cho chnh
thng hiu ca h... Cng ty c th p ng vi nh i tiu
dng m h to ra thng qua n truyn thng x hi v phn
tch
.
Tuy nhin, cc nh phn tch ngnh cng nghip lu rng vic tn dng cc
phng tin truyn thng mi cho mc ch hnh nh sn phm i hi c
cng ngh mi.
Cc nh tip th lun lun cn gim st cc phng tin truyn thng cho
thng tin lin quan n thng hiu ca mnh, cho d l i vi cc hot ng
quan h cng chng, vi phm gian ln, hoc tnh bo cnh tranh. Nhng phn mnh
cc phng tin truy thay i hnh vi ca ngi tiu dng
truyn thng. Technorati c tnh rng 75.000 blog mi
c to ra mi ngy, cng vi 1,2 triu bi vit mi ngy u kin
ngi tiu dng tho lun v sn phm v dch v.
V v c nhn
h thng c kh nng t ng phn tch ca ngi tiu dng
17
S pht trin c n hon chnh
c th lin quan n vic tn cng ln nhau trong nhng vn sau y.
(sentiment analysis
(opinion mining) gn y thu ht c s quan tm r
ng nhn thc v cc vn nghin cu v
c h .
18
:
S gia tng ca cc phng php hc my, x l ngn ng t nhin v
khi phc thng tin.
S sn c c li t ton h
a Internet, c th pht tri
.
Thc hin nhng thch thc tr tu, thng mi v cc ng d
.
2.5.1 Xc nh cm t, quan im
2.5.3 S dng cc ng t
21
2.5.4 Xc nh chiu hng, cm t, quan im
Trong phn tch quan im, xu hng ca nhng t, cm t trc tip th hin
quan im, cm xc ca ngi vit bi. Phng php chnh nhn bit xu hng
quan im ca nhng t, cm t ch cm ngh l da trn thng k hoc da trn t
vng
nhin. C hai h :
(Sentiment Classification (Sentiment Extraction)
: bao gm 3 nhim v chnh l:
- .
- positive,
negative )
-
22
Phn lp cu/ti liu cha quan i
23
:
/
.
: This laptop is great.
=>
.
VD: The stock prise rose
Rating inference (ordinal regression :
5 sao.
,
neutral
).
24
2.7.3 Xy dng m hnh phn lp phn loi ti liu
Trong phn tch quan im, xu hng ca nhng t, cm t trc tip th hin
quan im, cm xc ca ngi vit bi. Phng php chnh nhn bit xu hng
quan im ca nhng t, cm t ch cm ngh l da trn thng k hoc da trn t
vng. Vi nhim v phn lp cc ti liu, c rt nhiu cc phng php hc my
thng k c s dng cho mc ch ny, nh l: Naive Bayes, phn loi Maximum
Entropy, hc my gim st SVM, cy quyt nh,
25
3 CHNG III: CHNG TRNH THC NGHIM
26
Chng trnh Ngram thng k tn s xut hin ca cc cm Ngram. Kt qu
ca vic thng k c ghi li vo mt tp hoc s dng chng xy dng m
hnh ngn ng. Kt qu ca vic thng k c ghi li theo nh dng sau:
Trong :
27
Mt s c im ni bt ca java
- My o java
- Thng dch
- c lp nn
- Hng i tng
a nhim, a lung
28
:
train_file .
- Tn train_file .
-
.
model_file: .
- .
-
.
- ).
Cc bc thc hin
Bc 1: s dng cng c N-gram sinh ra cc file d liu cha cc N-gram
ca ti liu cha quan im. y, chng ti s dng uni-gram (1-gram) v Bi-
gram (2-gram).
chn
thc hin phn lp ti liu quan im, chng ti chia tp d liu thnh
hai tp con l tp hun luyn (train) v tp kim th (test)
Tp hun luyn gm c 550 nhn xt tch cc v 550 nhn xt tiu cc.
Tp kim th (test) gm c 150 nhn xt tch cc v 150 nhn xt tiu cc.
30
3. 1: Giao din chnh ca chng trnh
31
3.3: Hin th d liu dng chy Get Pos Data
32
3.5: Hnh nh khi chy Lnh SVM trong mi trng DOS
33
4 KT LUN
34
5 TI LIU THAM KHO
4. http://en.wikipedia.org/wiki/Support_vector_machine
5. http://www.cs.cornell.edu
6. http://svmlight.joachims.org/
7. ftp://ftp.cs.cornell.edu/pub/smart/english.stop
8. http://www.speech.sri.com/projects/srilm/download.html
35