You are on page 1of 25

.

Text Mining

:
:

. -

2009

I.........................................................................................................................4
II. ....................................................................5
a. ................5
b. ......................................5
c. ...........................................6
III. ...............................................................................7
a. ...................................7
b.- -
....................................................................................................8
c. ............................................................8
IV. ........................................................................9
a. ......................................................................................9
b. .........10
V. ................................................................11
VI. ............................................................12
a. .......................................................................12
b. ....................................................................................13
c. .......................................................14
d. .................................................15
e. ......................................................................................16
f.Rocchio ...............................................................................................17
g. ..........................................................................................17
h. ...............................................................18
i. ..........................................................................18

18

j. : Bagging and Boosting....................................19


VII.
..................................................................................................................................21
VIII. ......................................................22
a. .................................................................................22
b. .................................................................................22
c. .............................................................23
...........................................................................................24

18

I.


. ,

.
, /
( , TC- text categorization). T :

,
.
60-
(Maron 1960).
.
1990-

.
- ()
,
e-mail, web
, ,
.
,
. knowledge
engineering
.
e machine learning(ML)

.

,
.
. -

. ,
ML ,

.
.
.

. ,
,
.

18

II.

,
Web .
,

.
a.

. (),
,
.
(queries) .

a ,
NASA aerospace MESH
.

.
,

.
, , k
. ,
,
.

.
,
, ...

.
b.

. ,
, , ... -mail

18

,
, , .

.
.
, online
:
.
.


.
:
-



, .

(news feeds)
feed- .

:
-

recall errors-

precision errors-

.
recall
( : e-mail
Spam Spam mail Spam).


.
,
.
c.


Yahoo.

.

.

18


.
,

.

.

. (links)
,
.

.

.

III.



F : D C {0, 1}, D e ,
. F(d, c) e 1
d 0 .
M : D C {0, 1} ,


F.
a.
F,
.
, .

.
.

. |
C| (|C| e ),
.
.

18

b. -

:
,
. - . ,
.
- .

. , online
.
,
, .
.
c.

- .
.

. , .


.
ranking .
[0, 1],
- .
(CSV-categorization
status value).
,
CSV .

.
,
.
. k
.


.

( ).

.
18

(.
).

IV.


. ,
,
.
(feature vectors).
,
. , .
.
bag-of-words
,
.
. ,
1- () , 0 .
, ,
. TF-IDF w
d

TF IDF _ Weight ( w, d ) = TermFreq ( w, d ) log( N / DocFreq ( w)),


TermFreq ( w, d ) , N
, DocFreq (w)
w.
a.

, .
.
bag-of-words

, ,
.


,
.
.

stop

18


. , ,
90 99 .
,
().

DocFreq (w) .
10

. . IR(Information Retrieval),
--
. , ,
,
10 - .

. information gain

IG ( w) =

cC C f { w , w}

P ( f , c) log

P (c | f )
P (c )


f
.
. .
2
max
( f ) = max
cC

| Tr | ( P ( f , c ) P ( f , c ) P ( f , c ) P ( f , c )) 2
P ( f ) P ( f ) P (c ) P (c )


. (
)
100 ,
(Yang & Pedersen, 1997)

b.
,

.

. ()

18

, , .

.
, .
.

.


(LSI).
.

V.


.

.
.
CONSTRUE
(Reuters). CONSTRUE :
If DNF( )formula then category else

category


If ((wheat & farm) or
(wheat & commodity) or
(bushels & export) or
(wheat & tonnes) or
(wheat & winter & soft))
then Wheat
else Wheat

90%

Reuters(723 ).

,

.
,
( -
),
.

18

VI.

,

. ML ,


.
;
,
.


.
. ,
. ,
30 . ,
. ,

.
.
a.

CSV ( d , c) P(c | d ) d
c
:
P (c | d ) =

P ( d | c ) P (c )
P(d )

P (d )
. P( d | c )

d.
d = ( w1 , w2 ,...) ,
.
,

P ( d | c ) = P ( wi | c ) .
i

(NB)
.

18


,
.

Domingos and Pazzani (1997).
b.
P (c | d ) .
(BLR)
,
,
.
,

P (c | d ) = ( d) = (i i d i ) ,

c = 1 e (

1 {0,1} ), d = (d1 , d 2 ,...)


, = ( 1 , 2 ,...)
,

( x ) =

exp( x)
1
=
1 + exp( x ) 1 + exp( x)

1

i
.
, .
()
:

p( i | ) = N (0, ) =

1
2

exp(

i2
).
2

overfitting
.


. ,

18


.
TODO( ).
c.

.
,
.
(DT)
,
, . DT
,

. .
DT ,
.
DT-
ID3, C4.5, CART.
: f ,
,
.
. .
information gain .

(
) .
,
.
( )
.
DT
.

18

.1
d.
(Decision rule classifiers-DR)
.
DNF CONSTRUE


.

(

) . DNF
- .

.
d1 d 2 ... d 3 c ,

d i c .
(.,
),
, .
,
.
.
18


,
, .

RIPPER(repeated incremental pruning to produce error reduction)
(Cohen,
1995a;1995b; Cohen & Singer 1996). Ripper


.
Ripper
loss ratio ,

.
e.

F
.

. ,

.
,
, (real-value) .
(linear least
square fit-LLSF), Yang and
Chute(1994). ,
| C | | F | ,

.
. LLSF


M = arg min m || MD O || F ,

D e | F | | TreningMno zestvo |
| C | | TreningMno zestvo |
,

, || ||F (Weisstein).
|| A || F =

2
ij


. , mij
i- -
.
18

f. Rocchio
Rochio

. c

( w1 , w2 ,...)
wi =

| POS (c ) |

di
d POS ( c )

| NEG (c ) | d NEG

di
(c)

POS(c) NEG(c)
c , wdi
i- d. ,
, >> . = 0 ,

.
Rochio ,
. .
g.
.
,
CSV ,
(.

). ,
;

.
backpropagation2,
.
( ),

.
.
, .
.
.

2

Backpropagation
. 1974 Paul
Werbos. .

18


, . .
h.



. lazy
learners,
.

.
kNN (-
). d c , kNN
k
c.
, ;, .
Distance-weighted kNN
(
) .

k.
,
. Larkey Kroft(Larkey and Croft 1996)
k=20, Yang(Yang 2001) 30 k 45
k
, .
kNN
.
, .
.
i.
(SVM)
.
SVM -

,

.
,

, .

, .
18

SVM
, .

. SVM

SVM

,
.

.2 SVM
j. : Bagging and Boosting

,
.
bagging ,
.
,

.
k
.
,
.

18

( k +1) / 2
( k ). ,
, ,
k-
.
.
.
Boosting
. bagging
. i-
,

. AdaBoost
.
:

D = {( d 1 , c1 ), ( d 2 , c3 ),...} , d i X
, c i {+1,1}
c i
. lazy learner
() h : X {1} ,

W. (- ) goodness

(h, W ) =

W (i ) ,

i:h ( d i ) ci

.
AdaBoost

W1 (i ) = 1 / | D |
i,

t=1,..., k.
ht Wt
t =

1 (ht , Wt )
1
log
,
2
(ht , Wt )

Wt +1 (i ) = Z t Wt (i ).

exp( t ),
exp( t ),

ht ( d i ) = ci

( Z t

W
i

t +1

(i ) =1 ).

18

H ( d ) = sign (t =1... k t ht ( d )).


.,
.

VII.

ML
.
,
,
. ,
,
.


. ()

.
EM Nave Bayes.

. EM ,
:

-:

-:

-,

-, .
.

,
, . ,

.

18

,
.
( )

, .


( 60%)
.

VIII.

a.
IR(Information Retrieval)
recall precision. Recall

, precision

.
recall
precision
.
,
brakeven() , recall
precision .
b.
Reuters
, -
.
.

.

:

18

1)
( )
.
2) .
3) ,
. ,
,
, ,
.
, a .
,
.
OHUSMED

SH , 20 -
-, , TREC-AP
.
c.

,

SVM, AdaBoost, kNN


.
.
,

.

Rochio Nave Bayes


,
. ,NB
.


.

SVM.

18


Liu, H., Li, J., & Wong, L. (..). A Comparative Study on Feature Selection and
Classification Methods. Laboratories for Information Technology, 21 Heng Mui Keng
Terr, 119613 Singapore .
Moore, A. (2003). Informatin Gain. Carnegie Mellon University .
Rule of thumb. (..). 2 2, 2009 Wikipedia:
http://en.wikipedia.org/wiki/Rule_of_thumb
Suppoert Vector Machine. (..). February 6, 2009 Wikipedia:
http://en.wikipedia.org/wiki/Support_vector_machine
Tf-idf weighting. (..). February 8, 2008 http://nlp.stanford.edu/IRbook/html/htmledition/tf-idf-weighting-1.html
W.Cohen, W. (..). Text Classification: Advanced Tutorial. February 5,
2009, VideoLectures.net: http://videolectures.net/mlas06_cohen_tc/
Weisstein, E. W. (..). Chi-Squared Distribution. January 25, 2009,
WolframMathWorld: http://mathworld.wolfram.com/Chi-SquaredDistribution.html
Wikipedia. (1997). BackPropagation. http://en.wikipedia.org:
http://en.wikipedia.org/wiki/BackPropagation

18

Ye, N. (2003). HandBook of Data Mining. Mahwah, New Jersey London: Lawrence
Erlbaum Associates.
Feldman, R., & Sagner, J. Classification, Algorithm Analisys. In R. Feldman, & J.
Sagner, Text Mining Handbook. Cambridge.
.Maron, . (1960). Probabilistic Indexing and Information Retrieval. Journal of ACM .

18

You might also like