You are on page 1of 92

Ivaylo Kehayov (: 268)

:
:

2011

-i-



. . ,
, .

. ,
.
, .
,

/ .
. , . , ,
.

, . ,
. ,
.

Ivaylo Kehayov
2011
-i-

-ii-


.................................................................................................................... I
.......................................................................................................... III
1 ................................................................................................................1
1.1 ...............................................................................1
1.1.1 ...............................................................................2
1.1.2 ................................................................3
1.1.3 .........................................................4
1.2 ..................................................................................................6
2 - ................................................7
2.1 FLESH .................................................................................................................7
2.2 .............................................................................10
2.3 SVD AGGREGATION ..................................................................................13
3 SINGULAR VALUE DECOMPOSITION ............................................................21
3.1 ...........................................................................................................21
3.2 SVD .....................................................................22
3.2.1 SVD .......................................................24
3.2.2 ...................................................30
3.3 TOY EXAMPLE ..................................................................................................33
4 REUTERS .....................................41
4.1 ......................................................................................41
4.2 REUTERS..........................................................................................46
4.2.1 ...............................................................................46
4.2.2 ....................................................................................48
4.2.3 ..........................................................................................49
5 ............................................................................51
5.1 .........................................................................................................51
-iii-

5.2 0% .................................. 52
5.3 25% ................................ 53
5.4 50% ................................ 58
5.5 REUTERS ....................................... 65
6 ........................................... 71
........................................................................................................... 75
1: ................................................................. 79
1.1 GENERATOR .............................................................................. 79
1.2 REUTERS ......................................................................................................... 80
1.3 .................................................... 83

-iv-

1

, 1994 ( 1998) ,
.
.
Flesh Reading Ease . ,
,
.
Flesh Reading Ease .
, (data set) .

1.1
, .
,
, ,
, .

.
, , .
-1-

1.1.1
(text mining).
, . :

:

. , .

:
PubGene .

: , IBM Microsoft,
.

: ,
Tribune Company, ,
.

: .

: ,
, .

:

.
, (.. ) . ,
Nature,

.

-2-

,
,
. , ,
, . ,
spam e-mail .

1.1.2

.
.

,
.
(text classification categorization) (document classification categorization) , , , . :

(supervised classification), ( ) .

(unsupervised classification), .

- (semi-supervised classification), .

( Flesh Reading Ease)


. (data
set) , (collection
corpus). : (training set)

-3-

(test set). 70% , 30%.


.
. , ( ),
, . ,
( , ). , . , training set test set . training
set ground truth .
: nave Bayes, tfidf, Latent Semantic Indexing (LSI), Support Vector Machines (SVM), , k- (kNN), ( ID3 C4.5), concept mining .

1.1.3
, . , , , . (term frequency vector).
, , ,
(dictionary).
( )
training set.
. ,
. , . , automobile
. . , automobile , 7,
-4-

, 0 .
. ,
. , , :
1: The Sun is a star.
2: The Earth is a planet.
3: The Earth is smaller than the Sun.
( , ):
= [a, Earth, is, planet, smaller, star, Sun, than, the]
, 9.
:
1 = [1, 0, 1, 0, 0, 1, 1, 0, 1]
2 = [1, 1, 1, 1, 0, 0, 0, 0, 1]
3 = [0, 1, 1, 0, 1, 0, 1, 1, 2]
, 0 1, 2
the 3.
. , a, is the. .
, .
stop words,

.
,
.
, .

-5-

( 2) .

1.2
: (related work),
SVD. SVD,
.
, Reuters-21578.

. .
,
.

-6-

2

.
,
SVD-Aggregation.

2.1 Flesh
Flesh
. Flesh, Flesh Reading Ease Flesh-Kincaid Grade Level.
,
, , . ,
Flesh Reading Ease
Flesh-Kincaid Grade Level.
Rudolf Flesh ( J. Peter Kincaid),
.
Flesh-Kincaid Grade Level 1975
J. Peter Kincaid Rudolf
Flesh. Flesh Reading Ease 1978
.


Flesh Reading Ease.
.
Flesh Reading Ease : , . -7-

. :

.
( ) 120, . ,
, . 0 100 .

1: Flesh Reading Ease

Flesh

Reading Ease

90-100

80-89

70-79

60-69

50-59

30-49

0-29

-8-

2: Flesh Reading Ease

90-100

60-70

0-30

, http://en.wikipedia.org/wiki/Alan_Turing Alan Turing:


Alan Mathison Turing, OBE, FRS (pronounced TEWR-ing; 23 June 1912 7 June
1954), was an English mathematician, logician, cryptanalyst and computer scientist. He
was highly influential in the development of computer science, providing a formalization of the concepts of "algorithm" and "computation" with the Turing machine, which
played a significant role in the creation of the modern computer. During the Second
World War, Turing worked for the Government Code and Cypher School at Bletchley
Park, Britain's code breaking center. For a time he was head of Hut 8, the section responsible for German naval cryptanalysis. He devised a number of techniques for breaking German ciphers, including the method of the bombe, an electromechanical machine
that could find settings for the Enigma machine. After the war he worked at the National Physical Laboratory, where he created one of the first designs for a stored-program
computer, the ACE.
: 256
: 144
: 7
Flesh Reading Ease
:

-9-

Flesh Reading Easy 35.56


.
Microsoft Office Word, KWord,
WordPro, IBM Lotus Symphony, Abiword WordPerfect.

Flesh Reading Ease. , , Flesh
Reading Ease Flesh.

2.2
. , .
-1 1. 0, 1, .
90 0 ,
.
180 -1.
, . ( ) ( ).
,
0 1, .
90.

.

-10-

. , A B,
:

1
.

1:

,
:
1: The Sun is a star.
2: The Earth is a planet.
3: The Earth is smaller than the Sun.
:
= [a, Earth, is, planet, smaller, star, Sun, than, the]
:
1 = [1, 0, 1, 0, 0, 1, 1, 0, 1]
2 = [1, 1, 1, 1, 0, 0, 0, 0, 1]
3 = [0, 1, 1, 0, 1, 0, 1, 1, 2]
-11-

, :
4: The Sun is not a planet.
:
4 = [1, 0, 1, 1, 0, 0, 1, 0, 1]
training set test set.
.
. , , . ,
,
. :
1 4:

2 4:

3 4:

4 1 2
3. 3 , 4 .
: test set
training set ( ). test train .
training . ,
training 1, test . ,
1, 2 3,
-12-

. test set.

2.3 SVD Aggregation


Singular Value Decomposition (Aggregation) / oy . SVD , .
, training set
test set. training set
( 3).
3:

1
2

training set
.

SVD

U .
. ( )
.
i

(1im) , (
-13-

) U .

m -

. 4 ( 0 ).
4:

1
1

x ,

.
, (aggregated matrix)

D . , , ,
. . M D. ,
M

-14-

D, M D
. , M (
D ) () U.
test set training set. test m:

V S S ( ).
S :

c, c S. , doc_new
U. U
training set. .
- , M
( )
.
.
.
, . 5
. , . ,
1, 2.
SVD ,
,

3).

-15-

5:

10

11

12

10

, 75%
. 25% . S , ( ) . , . 75%
.
S U V.
U, S

-16-

, . 1010 . :

. -

. ,
M:

-17-

6:

E1

E2

E3

E4

E5

E6

E7

E8

E9

E10

E1

-0.79

-0.14

-0.6

0.21

-1.88

0.36

-0.46

0.3

-0.71

E2

-0.79

-1.86

0.81

-0.35

-1.2

0.81

-0.32

1.52

0.01

E3

-0.14

-1.86

1.45

0.27

-0.65

0.84

0.2

2.07

0.65

E4

-0.6

0.81

1.45

0.34

-0.27

0.89

-0.33

0.12

-0.95

E5

0.21

-0.35

0.27

0.34

0.31

1.36

-0.45

1.5

-0.39

E6

-1.88

-1.2

-0.65

-0.27

0.31

0.23

-0.36

0.21

-0.69

E7

0.36

0.81

0.84

0.89

1.36

0.23

-0.69

0.23

-0.09

E8

-0.46

-0.32

0.2

-0.33

-0.45

-0.36

-0.69

0.47

-1.62

E9

0.3

1.52

2.07

0.12

1.5

0.21

0.23

0.47

-0.36

E10

-0.71

0.01

0.65

-0.95

-0.39

-0.69

-0.09

-1.62

-0.36

.
. , , ,
.
, test,
:

,
. , :

test .
training U. , -

-18-

U.
.

:
7:

0.4667

0.2153

0.0500

0.6460

0.0165

0.5662

0.2792

0.3283

0.7906

10

0.7395

9.
, , . ( 0 ) 0.3091, 0.1200, 0.2141, 0.2338 -0.3612.
1, 4, 6, 7 10. 1 5 1, 6 10 2. 1 2. test
2. 1
, . .
5, . ,
.

-19-

-20-

3 Singular Value Decomposition


Singular Value Decomposition (SVD),
(toy example) .

3.1
.
, .
,
. . ,
. SVD,
.

() .
a, on, in, and,
is, are , (
,
). .
, ,
.
. ,
, , .
stop words, -

-21-

. . SVD
,
. ,
,
.
. . .
, .

3.2 SVD
Singular Value Decomposition (SVD),
, . ,
, . ,
, .
, ,

. , SVD .
, .
2.
( ).

.
, .

-22-

2:

, ,
3.
. ,
,
. , .

3:

-23-

SVD,
.

3.2.1 SVD
SVD mn, :
U mm , S mn
A, V, nn. :

. U

, V

, S

U V . U, S

.
SVD . :

-24-

. A :

, :

=10 =12. . =10 :

[1,-1] =10. =12

-25-

, =12 -

[1,1].
. , , , . , =12 =10
:

,
Gram-Schmidt .

V . V

-26-

=0, =10 =12

. =12

-27-

=12,

. =10

=10,

. =0

=0,

-28-

Gram-Schmidt :

V:

S
,

. U V ( -

=10 =12), .
SVD, SVD , S -29-

U V.
S A, U V .

3.2.2
SVD Latent Semantic Indexing
(LSI) Latent Semantic Analysis. SVD ,
, .

. ,

-30-

. SVD , . ,
.
, .
.

, SVD A
:

U V S . U

(), .

-31-

- .
U.
, ,
. U , . .
.
V .

Gram-Schmidt V:

S .
. , S
:

, U

-32-

A .

, , .
Singular Value Decomposition Tutorial Kirk
Baker, 2005.

3.3 Toy Example


, SVD . .
( 8).
. : , . (1-3) , (46) (7-9)
. ( ) (110). , . (4-6) E4, E5
E6. , 7, 8 9 7, 8 9. 10 -

-33-

. training
set.
8:

10

SVD 8
U, S V ( V,

-34-

):

.
. ,

) S.


S . S V
:

, test,
:

-35-

test set.
.
SVD
training set. , , ( test_new) :

, training set
-

. test_new U

test 1, 2,
. . 9 test 1 9.
9: test

-36-


test

-0.14

0.53

-0.33

0.57

-0.32

-0.38

0.02

-0.13

0.03

4 . , test . .
.
5. , , 5
, .
, 2, 4 9
.
,
, 4.
. , . 80% . S. . ,
. ,
( ) . .
51.85.
, 0.8:

S . 41.48. 41.19, 45.04 41.48. , ,


U, S V :

-37-

U test_new
:

test_new ,
. test
( test_new) training set
( 10).
. 1 test. ,
, 1, 2 3,
.
SVD 5

-38-

.
.
10: test


test

0.97

0.87

0.69

0.37

-0.8

0.26

-0.67

0.2

-0.16

-39-

-40-

4 Reuters
SVD,
, Reuters-21578
.
.

4.1
.
.
, .
easy.txt, medium.txt
hard.txt.
. . , easy.txt , medium.txt hard.txt
( ).
,
.
. , . , .
, 4.

-41-

. , 100, 300
, 100 , .

4:

5 (
, ). 5.

5:

,
. .
, 10.
(
6).

-42-

6:

1/x.
x . ,
5, 1/5 0.2, 20% . ,
20% . , . ,
, easy.txt,
medium.txt hard.txt. .
, easy.txt
hard.txt, easy.txt medium.txt.
, ,
50% 50% . ,
,
50%
50%. 4
25%. , ok

Enter ( 7). 15
. 1 15,
( 8).

-43-

7: ok

8: 15

(1-5) ,
(6-10) (11-15). , -

-44-

E ( easy),
M ( medium) H (
hard). , Flesh
, . , 1/10
(10%) .
10 .
1/15 (6.67%) 15
. , 1/20 (5%), 20
. , , .
, , easy.txt, medium.txt hard.txt . easy.txt easy1, easy2 easy3,
medium.txt medium1, medium2 medium3, hard.txt hard1, hard2 hard3. ,
. , , easy1,
easy2 easy3 . (9, 10
11) ,
.
9 .
E . ,
( 25%). , .

9:

10 . M
. .
,
.
-45-

10:

11. . , ,
.

11:

, , a, the, on, in, is, are and.


easy.txt,
medium.txt hard.txt.

4.2 Reuters

. .
Reuters-21578.

4.2.1
Reuters-21578 1987
Reuters .

Reuters Ltd. (Sam Dobbbins, Mike Topliss, Steve Weinstein) Carnegie Group,
Inc. (Peggy Andersen, Monica Cellio, Phil Haeys, Laura Knecht, Irene Nirenburg)

-46-

1987. 1990, Reuters CGI ( W.


Bruce Croft) Amherst.
1990 Davis D. Lewis Stephen
Harding .
1991 1992 David D. Lewis Peter Shoemaker
.
1993 Reuters-22173, Distribution 1.0. 1993 1996, Distribution 1.0
FTP
Amherst. 1996 ACM SIGIR,
Reuters-22173
. , . . Steve Finch David D. Lewis
1996, Finch ,
SGML. 595 ,
. 21578 , Reuters-21578. Reuters-22173
Reuters-21578 :

(tags) SGML
SGML , (
) .


. -

-47-

, ,
.

(ID), ,
1000 ID (
).

4.2.2
Reuters-21578 22 .
21 (reut2-000.sgm reut2-020.sgm) 1000 ,
(reut2-021.sgm) 578 . SGML .
:

reuters: :
o topics: yes, no bypass, , .
o lewissplit: training, test not-used. training training set David D. Lewis Representation and learning
in information retrieval, An evaluation of phrasal and clustered representations on a text categorization task, Feature selection an feature extraction for text categorization A comparison of two learning algorithms for text categorization. test test set non-used
.
o cgisplit: training-set published-testset training set test set
Philip J. Hayes A shell for content-based
text categorization A system for content-based indexing of a database of news stories.
o oldid: (ID) Reuters22173.
o newid: (ID) Reuters21578.

-48-

date: .

mknote: Steve Finch


.

topics: topics (
).

places: topics places .

people: topics people .

orgs: topics orgs .

exchange: topics exchange .

companies: .

unknown:
.

text: :
o author: .
o dateline: , .
o title: .
o body: .

4.2.3
Reuters-21578 : exchanges, orgs,
people, places topics.
. ,
.
11:

Exchanges

39

Orgs

56

People

267

Places

175

Topics

135

-49-

11 . topics
. coconut, gold,
inventories money-supply".
Reuters.
exchanges, orgs, people places
. nastag (exchanges), gatt (orgs), perez-de-cuellar (people) australia (places). ,
,
(
topics). , , . . ,
.

-50-

5
SVD
Flesh, SVD-Aggregation .
,
, Reuters.
, :

Cosine:

Flesh: Flesh Reading Ease

SVD-Cos: SVD 3

Agg-SVD: Aggregated SVD ( 2.3)

5.1

. :

a:

b: ( )

c:
(
)

d: ( ,
)

-51-

Recall, Precision, Fallout,


Accuracy Error :

Recall = a/(a+c) a+c > 0

Precision = a/(a+b) a+b > 0

Fallout = b/(b+d) b+d > 0

Accuracy = (a+d)/n n = a+b+c+d > 0

Error = (b+c)/n n = a+b+c+d > 0

Accuracy Error .
, (F-measure), Recall Precision:

5.2 0%
. .

4 .

100% F, .
Flesh
. , , ( , ).
12,
heat-map Aggregated
SVD ( )
Agg-SVD 30%.
.
100%. o
heat-map -

-52-

training set ( training set).

12: Heat-map Aggregated SVD

data set .

5.3 25%
25%.
. 12 .
100%,
SVD (SVD-Cos) 70%, 50% 30%, Aggregated
SVD 70% 30%. .

-53-

12:
Column1

Recall

Precision

Fallout

Accuracy

Error

F-measure

Cosine

100%

100%

0%

100%

0%

100%

Flesh

42%

100%

0%

80%

19%

60%

SVD-Cos 100%

29%

80%

3%

73%

26%

42%

Agg-SVD 100%

79%

42%

53%

57%

42%

55%

SVD-Cos 70%

100%

100%

0%

100%

0%

100%

Agg-SVD 70%

100%

60%

32%

79%

21%

76%

SVD-Cos 50%

100%

93%

3%

98%

2%

97%

Agg-SVD 50%

86%

92%

3%

93%

7%

89%

SVD-Cos 30%

100%

100%

0%

100%

0%

100%

Agg-SVD 30%

100%

100%

0%

100%

0%

100%

13:
Column1

Recall

Precision

Fallout

Accuracy

Error

F-measure

Cosine

100%

100%

0%

100%

0%

100%

Flesh

64%

50%

32%

66%

33%

56%

SVD-Cos 100%

93%

65%

25%

80%

19%

76%

Agg-SVD 100%

36%

83%

35%

76%

23%

50%

SVD-Cos 70%

100%

100%

0%

100%

0%

100%

Agg-SVD 70%

57%

89%

3%

83%

16%

70%

SVD-Cos 50%

93%

100%

0%

98%

2%

96%

Agg-SVD 50%

93%

87%

7%

93%

7%

90%

SVD-Cos 30%

100%

100%

0%

100%

0%

100%

Agg-SVD 30%

100%

100%

0%

100%

0%

100%

SVD-Cos 100% Agg-SVD 100% , Agg-SVD 50%


. Flesh .
-54-

, 16 17,
.

. , Agg-SVD 30%, SVD-Cos 70%
SVD-Cos 30% .
14:
Column1

Recall

Precision

Fallout

Accuracy

Error

F-measure

Cosine

100%

100%

0%

100%

0%

100%

Flesh

92%

72%

17%

85%

14%

81%

SVD-Cos 100%

93%

76%

14%

88%

11%

84%

Agg-SVD 100%

57%

80%

7%

80%

19%

67%

SVD-Cos 70%

100%

100%

0%

100%

0%

100%

Agg-SVD 70%

71%

100%

0%

90%

9%

83%

SVD-Cos 50%

100%

100%

0%

100%

0%

100%

Agg-SVD 50%

100%

100%

0%

100%

0%

100%

SVD-Cos 30%

100%

100%

0%

100%

0%

100%

Agg-SVD 30%

100%

100%

0%

100%

0%

100%

13 F
. SVDCos . , , Agg-SVD,
.

-55-

13: F-measure

15, 14
SVD-Cos, SVD-Cos 50% .
SVD-Cos 30% . .

14: F-measure SVD-Cos

-56-

15 .

15: F-measure Agg-SVD

, 16 heat-map Aggregated SVD.


.

16: Heat-map Aggregated SVD

-57-

5.4 50%

50% . 15 . SVD-Cos 30%. Agg-SVD 30% Cosine.
SVD , .
Flesh .
.
.
, , , .
Flesh
() .

15:
Column1

Recall

Precision

Fallout

Accuracy

Error

F-measure

92%

86%

7%

92%

7%

89%

5%

3%

81%

15%

85%

2%

SVD-Cos 100%

71%

42%

50%

57%

42%

53%

Agg-SVD 100%

36%

33%

35%

55%

45%

34%

SVD-Cos 70%

99%

74%

17%

88%

11%

85%

Agg-SVD 70%

93%

52%

42%

69%

30%

67%

SVD-Cos 50%

99%

70%

21%

86%

14%

82%

Agg-SVD 50%

93%

59%

32%

76%

23%

72%

SVD-Cos 30%

99%

99%

0%

99%

0%

99%

Agg-SVD 30%

99%

88%

7%

95%

4%

93%

Cosine
Flesh

-58-

16:
Column1

Recall

Precision

Fallout

Accuracy

Error

F-measure

Cosine

78%

78%

10%

85%

14%

78%

Flesh

92%

46%

53%

61%

38%

61%

SVD-Cos 100%

43%

55%

17%

69%

30%

48%

Agg-SVD 100%

29%

33%

28%

57%

42%

30%

SVD-Cos 70%

86%

99%

0%

95%

4%

92%

Agg-SVD 70%

38%

71%

71%

74%

26%

48%

SVD-Cos 50%

71%

90%

3%

88%

11%

80%

Agg-SVD 50%

43%

55%

17%

69%

30%

48%

SVD-Cos 30%

99%

78%

14%

90%

9%

88%

Agg-SVD 30%

86%

75%

14%

86%

14%

80%

16 .

. SVD-Cos, Agg-SVD,
Cosine Flesh. SVD-Cos. 70%
30% . Agg-SVD 30% 50%
( F) 70% 100%.
. Flesh
. Flesh, F-measure .
, ,
.
Recall,
, Precision, . ,
. , , -

-59-

.
, . ,
, ,
.
17:
Column1

Recall

Precision

Fallout

Accuracy

Error

F-measure

Cosine

85%

92%

35%

92%

7%

88%

Flesh

85%

85%

7%

90%

10%

85%

SVD-Cos 100%

36%

71%

71%

73%

26%

48%

Agg-SVD 100%

29%

27%

39%

50%

50%

28%

SVD-Cos 70%

79%

99%

0%

93%

7%

88%

Agg-SVD 70%

64%

90%

3%

86%

14%

75%

SVD-Cos 50%

71%

90%

3%

88%

11%

80%

Agg-SVD 50%

57%

89%

3%

83%

16%

70%

SVD-Cos 30%

71%

99%

0%

90%

9%

83%

Agg-SVD 30%

57%

80%

7%

80%

19%

67%

17 .
.
Flesh . Recall ,
Precision . , ,
70% , 30% .
17 F
.
SVD-Cos. Agg-SVD, SVD-Cos
. -60-

SVD-Cos, Flesh
.

17: F-measure

18: F-measure SVD-Cos

18 F SVD-Cos.
30% , 70%,
30% . -

-61-

, 30% 70% .
19 F Agg-SVD . 100% . 70% 50%
. 30%
.
.

19: F-measure Agg-SVD

20: Recall-Precision

-62-

20, 21 22 Recall-Precision .
.
SVD-Cos Agg-SVD . Flesh
. ,
. Flesh Agg-SVD, .

21: Recall-Precision

22: Recall-Precision

23 Recall Precision .
-63-

SVD-Cos, Agg-SVD Cosine,


. Flesh .

23: Recall-Precision

, heat-map
Agg-SVD ( 24). heat-map , .

24: Heat-map Aggregated SVD

-64-

5.5 Reuters

. , Reuters 21578 . , , . Flesh,
.
, ,
. , , coffee, gold ship.
Topic
. coffee
, gold
ship .
18
coffee. SVD-Cos 50%,
30% . Agg-SVD
, 30% . ,
.
18: coffee
Column1

Recall

Precision

Fallout

Accuracy

Error

F-measure

Cosine

95%

60%

48%

72%

27%

75%

SVD-Cos 100%

58%

94%

2%

80%

19%

71%

Agg-SVD 100%

85%

46%

74%

50%

49%

59%

SVD-Cos 70%

88%

92%

5%

92%

8%

90%

Agg-SVD 70%

69%

60%

34%

67%

32%

64%

SVD-Cos 50%

92%

98%

1%

97%

3%

96%

Agg-SVD 50%

77%

71%

22%

77%

22%

74%

SVD-Cos 30%

92%

96%

2%

95%

5%

94%

Agg-SVD 30%

69%

90%

5%

83%

16%

78%

-65-

gold ( 19). Cosine SVD-Cos


50% . Agg-SVD 70% .
19: gold
Column1

Recall

Precision

Fallout

Accuracy

Error

F-measure

Cosine

64%

84%

4%

86%

13%

73%

SVD-Cos 100%

97%

43%

52%

62%

37%

60%

Agg-SVD 100%

47%

62%

11%

77%

22%

53%

SVD-Cos 70%

88%

83%

6%

92%

8%

86%

Agg-SVD 70%

47%

80%

45%

82%

18%

59%

SVD-Cos 50%

94%

89%

45%

95%

5%

91%

Agg-SVD 50%

65%

55%

20%

75%

24%

59%

SVD-Cos 30%

88%

79%

9%

90%

10%

83%

Agg-SVD 30%

88%

68%

15%

85%

14%

77%

20: ship
Column1

Recall

Precision

Fallout

Accuracy

Error

F-measure

Cosine

22%

80%

2%

75%

24%

34%

SVD-Cos 100%

28%

97%

1%

79%

21%

43%

Agg-SVD 100%

25%

81%

2%

70%

29%

36%

SVD-Cos 70%

89%

89%

46%

93%

7%

89%

Agg-SVD 70%

78%

67%

16%

82%

18%

72%

SVD-Cos 50%

94%

89%

4%

95%

5%

92%

Agg-SVD 50%

44%

62%

11%

75%

24%

52%

SVD-Cos 30%

78%

82%

7%

89%

11%

80%

Agg-SVD 30%

67%

63%

16%

77%

21%

65%

-66-

20 ship
19 Cosine .

25: F-measure

26: F-measure SVD-Cos

25 F-measure
.
-67-

SVD-Cos, 70%, 50% 30%.


26 F SVD-Cos.
100% .
, 50%
.
Agg-SVD 27.
. .
,
gold ship. coffee
70% .

27: F-measure Agg-SVD

28, 29 30 Recall-Precision,
. SVD-Cos
. ,

Agg-SVD Cosine,

(Precision) Agg-SVD
Recall. SVD
, . 30%
Agg-SVD 50% SVD-Cos.

-68-

28: Recall-Precision coffee

29: Recall-Precision gold

,
31. Recall-Precision
, SVD-Cos
.
Agg-SVD.

-69-

30: Recall-Precision ship

31: Recall-Precision coffee, gold ship

-70-

6

. ,
Singular Value Decomposition (SVD) . ,
Flesh Reading Ease.
,
, . ,
Flesh Reading Ease,
Aggregated SVD. , Flesh Reading Ease . , Aggregated
SVD, SVD Aggregation
SVD, ,
.
, ,
SVD . , Reuters-21578.
: .
,
: , . , ,
0%, 25% 50% . , -71-

Reuters, ,
.
SVD , .
SVD
. ,
. ,
,
.

-72-

-73-


Baeza-Yates, R., Ribeiro-Neto B. (1999) Modern Information Retrieval, Addison
Weslay Longman Inc.
Baker, K. (2005) Singular Value Decomposition Tutorial, http://www.cs.wits.ac.za/
~michael/SVDTut.pdf
Balabanovic, M., Shoham, Y. (1997) Fab: Content-based, collaborative recommendation, ACM Communications, volume 40, number 3, pages 66-72
Bharat, K., Henzinger M. R. (1998) Improved Algorithms for Topic Distillation in a
Hyperlinked Enviroment, Proc. ACM Conf. Res. and Developments in Information Retrieval
Chakrabarti, S., Dom, B., Indyk, P. (1998) Enhanced Hypertext categorization Using
Hyperlinks, Proc. ACM SIGMOD international Conference on Management Data, volume 27, number 2, pages 307-318
Edelstein, H. (1999) Introduction to Data Mining and Knowledge Discovery. 3rd ed.,
Two Crows Corporation
Furnas, G., Deerwester, (1988) Information retrieval using a singular value
decomposition model of latent semantic structure, SIGIR, pages 465-480
Guan, H., Zhou, J., Guo, M. (2009) A Class-Feature-Centroid Classifier for Text Categorization, WWW 2009, Madrid, pages 201-210
Han, J., Kamber, M. (2001) Data Mining: Concepts and Techniques. CA: Morgan
Kaufmann, San Francisco
Hans-Henning, G., Spiliopoulou, M., Nanopoulos, A. (2010) Eigenvector-Based Clustering Using Aggregated Similarity Matrices. ACM SAC 10, pages 1083-1087
Herlocker, J., Konstan, J., Terveen, L., Riedl, J. (2004) Evaluating Collaborative Filtering Recommender System, ACM Trans. on Information Systems, volume 22, number 1,
pages 5-53

-75-

Joachims, T. (1998) Text categorization with support vector machines: Learning with
many relevant features, Springer Verlag, pages 137-142
Kim, H., Howland, P., Park H. (2006) Dimension Redeuction in Text Classificaton with
Support Vector Machines, Journal of machine Learning Research, volume 6, number 1,
pages 37-53
Lewis, D. D. (1991) Evaluating text categorization, HLT 91 Proceedings of the workshop on Speech and Natural Language, 312-318
Melville, P., Mooney, R. J., Nagarajan, R. (2002) Content-Boosted Collaborative filtering for Improved Recommendations, AAAI, pages 187-192
Montanes, E., Diaz, I., Ranilla, J., Combarro, E. F., Fernandez J. (2005) Scoring and
selecting terms for text categorization, IEEE Intelligent Systems, volume 20, number 3,
pages 40-47
Papadimitriou, C. H., Raghavan, P., Tamaki, H., Vempala, S. (1998) Latent Semantic
Indexing: a Probabilistic Analysis, Proc. ACM Symposium on Pronciples of Database
systems, volume 61, number 2, pages 217-235
Pei, Z. L., Shi, X. H., Marchese, M., Liang, Y. C., (2007) An enhanced text categorization method based on improved text frequency approach and mutual information algorithm, Progress in Natural Science, volume 17, number 12, pages 1494-1500
Robertson, S. E., Spark Jones K. (1976) Relevance weighting of search terms, Journal
of the American Society for Information Sciences, volume 27, number 3, pages 129-146
Sabwar, B., Karypis, G., Konstan, J., Riedl, J. (2000) Application of dimensionality reduction in recommender system A case study, ACM WebKDD Workshop, volume
1625, number 1, pages 285-295
Salton, G., Buckley, C., (1988) Term weighting approaches in automatic retrieval, Information Processing & Management, volume 24, number 5, pages 513-523
Soucy, P., Mineau, G. W. (2003) Feature selection strategies for text categorization,
Advances in Artificial Intelligence, Proceedings, 2671, pages 505-509
Symeonidis, P. Content-based Dimensionality Reduction for Recommended System,
GfKl 2007, pages 619-626
Yang, Y., J. Pedersen, J.(1997) A comparative study on feature selection in text categorization, In International Conference on Machine Learning (ICML), pages 412-420

-76-

Yang Y. (1997) An evaluation of statistical approaches to text categorization, Journal


Information Retrieval, volume 1, number 1-2, pages 69-90
Yang, B., Sun, J.T., Wang, T., Chen Z. (2009) Effective Multi-Label Active Learning for
Text Classification, ACM KDD 09, pages 917-926

-77-

1:

. Flesh Reading Ease http://flesh.sourceforge.net ( cd Flesh). Microsoft Visual Studio 2008 Matlab R2010a Windows 7.

1.1 Generator
. , Generator cd
. MS Visual Studio ( 2008 )
FileOpenProject/Solution.
Generator .
Generator.sln . Open, Generator .
F5.
,
.
. 90 , 30 ,
150 50% ( 32).
training set. /Generator/Generator/
( easy.txt, medium.txt hard.txt
). 90 (
). (..
) (.. training_set) -79-

90 . training
set .

32: 90 training set

test set ( F5) , , 30 - 10 - ( 33). ,


/Generator/Generator/
(.. test_set).
test set.

33: 30 test set

, training set
test set. O . 1.3 .

1.2 Reuters

, Reuters.
, ,
. , (.. ) training_set test_set.
, Reuters 22
(reut2-000.sgm reut2-021.sgm), 21 578 . Reuters21578. (..
Notepad++) 22
sgml tags. TOPIC .
-80-

tag .
.
( .. ,
coffee, gold ship).
earn .
22 Find earn. :
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="5757" NEWID="214">
<DATE>26-FEB-1987 19:16:39.70</DATE>
<TOPICS><D>earn</D></TOPICS>
<PLACES><D>usa</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
&#5;&#5;&#5;F
&#22;&#22;&#1;f0395&#31;reute
u f BC-AVERY-&lt;AVY>-SETS-TWO 02-26 0086</UNKNOWN>
<TEXT>&#2;
<TITLE>AVERY &lt;AVY> SETS TWO FOR ONE STOCK SPLIT</TITLE>
<DATELINE>

PASADENA, Calif., Feb 26 - </DATELINE><BODY>Avery said its

board authorizerd
a two for one stock split, an increased in the quarterly
dividend and plans to offer four mln shares of common stock.
The company said the stock split is effective March 16 with
a distribution of one additional share to each shareholder of
record March 9.
It said the quarterly cash dividend of 10.5 cts per share

-81-

on the split shares, a 10.5 pct increase from the 19 cts per
share before the split.
Avery said it will register with the Securities and
Exchange Commission shrortly to offer four mln additional
common shares. It will use the proceeds to repay debt, finance
recent acquisitions and for other corporate purposes.
Reuter
&#3;</BODY></TEXT>
</REUTERS>
tag TOPIC (earn). tag BODY
. , tag BODY,
, txt
. training_set . ,
Reuters , earn (
, 22 Reuters).
, , txt, training_set.

. txt .
, txt
E
. , E
Reuters.
.
,
TOPIC .
. E
M H. training_set
txt .

-82-

test_set ,
TOPIC training set. , , earn
E, M H . ,
, training set , training_set test_set.
training set set ,

sgml tags.
cd Reuters training set test set
coffee, gold ship,
.

1.3
, Diplomatiki
cd .
MS Visual Studio ( 2008 )
FileOpenProject/Solution.
Diplomatiki .
Diplomatiki.sln . Open, Diplomatiki .
, Diplomatiki.cpp Solution Explorer MS
Visual Studio. ,
training set test set. 49 50 () training set test set ,
34:

34: training test set

-83-

F5 .
Matlab (
R2010a). , Matlab
. ,
,
35:

35:

284, training
set 90 test set 30.
SVD-Cos
Agg-SVD. 0 1.
30% ,
0.3 Enter . ,
, U SVD
( 36). 8
30% .

36:

, .
Easy , Medium
Hard . Reuters
Easy, Medium Hard E, M H txt, .

-84-

37:

38: SVD

-85-

39: Aggregated SVD

40: Flesh Reading Ease

-86-

You might also like