Professional Documents
Culture Documents
Text Mining
:
:
. -
2009
I.........................................................................................................................4
II. ....................................................................5
a. ................5
b. ......................................5
c. ...........................................6
III. ...............................................................................7
a. ...................................7
b.- -
....................................................................................................8
c. ............................................................8
IV. ........................................................................9
a. ......................................................................................9
b. .........10
V. ................................................................11
VI. ............................................................12
a. .......................................................................12
b. ....................................................................................13
c. .......................................................14
d. .................................................15
e. ......................................................................................16
f.Rocchio ...............................................................................................17
g. ..........................................................................................17
h. ...............................................................18
i. ..........................................................................18
18
18
I.
. ,
.
, /
( , TC- text categorization). T :
,
.
60-
(Maron 1960).
.
1990-
.
- ()
,
e-mail, web
, ,
.
,
. knowledge
engineering
.
e machine learning(ML)
.
,
.
. -
. ,
ML ,
.
.
.
. ,
,
.
18
II.
,
Web .
,
.
a.
. (),
,
.
(queries) .
a ,
NASA aerospace MESH
.
.
,
.
, , k
. ,
,
.
.
,
, ...
.
b.
. ,
, , ... -mail
18
,
, , .
.
.
, online
:
.
.
.
:
-
, .
(news feeds)
feed- .
:
-
recall errors-
precision errors-
.
recall
( : e-mail
Spam Spam mail Spam).
.
,
.
c.
Yahoo.
.
.
18
.
,
.
.
. (links)
,
.
.
.
III.
F : D C {0, 1}, D e ,
. F(d, c) e 1
d 0 .
M : D C {0, 1} ,
F.
a.
F,
.
, .
.
.
. |
C| (|C| e ),
.
.
18
b. -
:
,
. - . ,
.
- .
. , online
.
,
, .
.
c.
- .
.
. , .
.
ranking .
[0, 1],
- .
(CSV-categorization
status value).
,
CSV .
.
,
.
. k
.
.
( ).
.
18
(.
).
IV.
. ,
,
.
(feature vectors).
,
. , .
.
bag-of-words
,
.
. ,
1- () , 0 .
, ,
. TF-IDF w
d
. , ,
90 99 .
,
().
DocFreq (w) .
10
. . IR(Information Retrieval),
--
. , ,
,
10 - .
. information gain
IG ( w) =
cC C f { w , w}
P ( f , c) log
P (c | f )
P (c )
f
.
. .
2
max
( f ) = max
cC
| Tr | ( P ( f , c ) P ( f , c ) P ( f , c ) P ( f , c )) 2
P ( f ) P ( f ) P (c ) P (c )
. (
)
100 ,
(Yang & Pedersen, 1997)
b.
,
.
. ()
18
, , .
.
, .
.
.
(LSI).
.
V.
.
.
.
CONSTRUE
(Reuters). CONSTRUE :
If DNF( )formula then category else
category
If ((wheat & farm) or
(wheat & commodity) or
(bushels & export) or
(wheat & tonnes) or
(wheat & winter & soft))
then Wheat
else Wheat
90%
Reuters(723 ).
,
.
,
( -
),
.
18
VI.
,
. ML ,
.
;
,
.
.
. ,
. ,
30 . ,
. ,
.
.
a.
CSV ( d , c) P(c | d ) d
c
:
P (c | d ) =
P ( d | c ) P (c )
P(d )
P (d )
. P( d | c )
d.
d = ( w1 , w2 ,...) ,
.
,
P ( d | c ) = P ( wi | c ) .
i
(NB)
.
18
,
.
Domingos and Pazzani (1997).
b.
P (c | d ) .
(BLR)
,
,
.
,
P (c | d ) = ( d) = (i i d i ) ,
c = 1 e (
( x ) =
exp( x)
1
=
1 + exp( x ) 1 + exp( x)
1
i
.
, .
()
:
p( i | ) = N (0, ) =
1
2
exp(
i2
).
2
overfitting
.
. ,
18
.
TODO( ).
c.
.
,
.
(DT)
,
, . DT
,
. .
DT ,
.
DT-
ID3, C4.5, CART.
: f ,
,
.
. .
information gain .
(
) .
,
.
( )
.
DT
.
18
.1
d.
(Decision rule classifiers-DR)
.
DNF CONSTRUE
.
(
) . DNF
- .
.
d1 d 2 ... d 3 c ,
d i c .
(.,
),
, .
,
.
.
18
,
, .
RIPPER(repeated incremental pruning to produce error reduction)
(Cohen,
1995a;1995b; Cohen & Singer 1996). Ripper
.
Ripper
loss ratio ,
.
e.
F
.
. ,
.
,
, (real-value) .
(linear least
square fit-LLSF), Yang and
Chute(1994). ,
| C | | F | ,
.
. LLSF
M = arg min m || MD O || F ,
D e | F | | TreningMno zestvo |
| C | | TreningMno zestvo |
,
, || ||F (Weisstein).
|| A || F =
2
ij
. , mij
i- -
.
18
f. Rocchio
Rochio
. c
( w1 , w2 ,...)
wi =
| POS (c ) |
di
d POS ( c )
| NEG (c ) | d NEG
di
(c)
POS(c) NEG(c)
c , wdi
i- d. ,
, >> . = 0 ,
.
Rochio ,
. .
g.
.
,
CSV ,
(.
). ,
;
.
backpropagation2,
.
( ),
.
.
, .
.
.
2
Backpropagation
. 1974 Paul
Werbos. .
18
, . .
h.
. lazy
learners,
.
.
kNN (-
). d c , kNN
k
c.
, ;, .
Distance-weighted kNN
(
) .
k.
,
. Larkey Kroft(Larkey and Croft 1996)
k=20, Yang(Yang 2001) 30 k 45
k
, .
kNN
.
, .
.
i.
(SVM)
.
SVM -
,
.
,
, .
, .
18
SVM
, .
. SVM
SVM
,
.
.2 SVM
j. : Bagging and Boosting
,
.
bagging ,
.
,
.
k
.
,
.
18
( k +1) / 2
( k ). ,
, ,
k-
.
.
.
Boosting
. bagging
. i-
,
. AdaBoost
.
:
D = {( d 1 , c1 ), ( d 2 , c3 ),...} , d i X
, c i {+1,1}
c i
. lazy learner
() h : X {1} ,
W. (- ) goodness
(h, W ) =
W (i ) ,
i:h ( d i ) ci
.
AdaBoost
W1 (i ) = 1 / | D |
i,
t=1,..., k.
ht Wt
t =
1 (ht , Wt )
1
log
,
2
(ht , Wt )
Wt +1 (i ) = Z t Wt (i ).
exp( t ),
exp( t ),
ht ( d i ) = ci
( Z t
W
i
t +1
(i ) =1 ).
18
.,
.
VII.
ML
.
,
,
. ,
,
.
. ()
.
EM Nave Bayes.
. EM ,
:
-:
-:
-,
-, .
.
,
, . ,
.
18
,
.
( )
, .
( 60%)
.
VIII.
a.
IR(Information Retrieval)
recall precision. Recall
, precision
.
recall
precision
.
,
brakeven() , recall
precision .
b.
Reuters
, -
.
.
.
:
18
1)
( )
.
2) .
3) ,
. ,
,
, ,
.
, a .
,
.
OHUSMED
SH , 20 -
-, , TREC-AP
.
c.
,
.
SVM.
18
Liu, H., Li, J., & Wong, L. (..). A Comparative Study on Feature Selection and
Classification Methods. Laboratories for Information Technology, 21 Heng Mui Keng
Terr, 119613 Singapore .
Moore, A. (2003). Informatin Gain. Carnegie Mellon University .
Rule of thumb. (..). 2 2, 2009 Wikipedia:
http://en.wikipedia.org/wiki/Rule_of_thumb
Suppoert Vector Machine. (..). February 6, 2009 Wikipedia:
http://en.wikipedia.org/wiki/Support_vector_machine
Tf-idf weighting. (..). February 8, 2008 http://nlp.stanford.edu/IRbook/html/htmledition/tf-idf-weighting-1.html
W.Cohen, W. (..). Text Classification: Advanced Tutorial. February 5,
2009, VideoLectures.net: http://videolectures.net/mlas06_cohen_tc/
Weisstein, E. W. (..). Chi-Squared Distribution. January 25, 2009,
WolframMathWorld: http://mathworld.wolfram.com/Chi-SquaredDistribution.html
Wikipedia. (1997). BackPropagation. http://en.wikipedia.org:
http://en.wikipedia.org/wiki/BackPropagation
18
Ye, N. (2003). HandBook of Data Mining. Mahwah, New Jersey London: Lawrence
Erlbaum Associates.
Feldman, R., & Sagner, J. Classification, Algorithm Analisys. In R. Feldman, & J.
Sagner, Text Mining Handbook. Cambridge.
.Maron, . (1960). Probabilistic Indexing and Information Retrieval. Journal of ACM .
18