Professional Documents
Culture Documents
SVD Diplomatiki
SVD Diplomatiki
:
:
2011
-i-
. . ,
, .
. ,
.
, .
,
/ .
. , . , ,
.
, . ,
. ,
.
Ivaylo Kehayov
2011
-i-
-ii-
.................................................................................................................... I
.......................................................................................................... III
1 ................................................................................................................1
1.1 ...............................................................................1
1.1.1 ...............................................................................2
1.1.2 ................................................................3
1.1.3 .........................................................4
1.2 ..................................................................................................6
2 - ................................................7
2.1 FLESH .................................................................................................................7
2.2 .............................................................................10
2.3 SVD AGGREGATION ..................................................................................13
3 SINGULAR VALUE DECOMPOSITION ............................................................21
3.1 ...........................................................................................................21
3.2 SVD .....................................................................22
3.2.1 SVD .......................................................24
3.2.2 ...................................................30
3.3 TOY EXAMPLE ..................................................................................................33
4 REUTERS .....................................41
4.1 ......................................................................................41
4.2 REUTERS..........................................................................................46
4.2.1 ...............................................................................46
4.2.2 ....................................................................................48
4.2.3 ..........................................................................................49
5 ............................................................................51
5.1 .........................................................................................................51
-iii-
5.2 0% .................................. 52
5.3 25% ................................ 53
5.4 50% ................................ 58
5.5 REUTERS ....................................... 65
6 ........................................... 71
........................................................................................................... 75
1: ................................................................. 79
1.1 GENERATOR .............................................................................. 79
1.2 REUTERS ......................................................................................................... 80
1.3 .................................................... 83
-iv-
1
, 1994 ( 1998) ,
.
.
Flesh Reading Ease . ,
,
.
Flesh Reading Ease .
, (data set) .
1.1
, .
,
, ,
, .
.
, , .
-1-
1.1.1
(text mining).
, . :
:
. , .
:
PubGene .
: , IBM Microsoft,
.
: ,
Tribune Company, ,
.
: .
: ,
, .
:
.
, (.. ) . ,
Nature,
.
-2-
,
,
. , ,
, . ,
spam e-mail .
1.1.2
.
.
,
.
(text classification categorization) (document classification categorization) , , , . :
(supervised classification), ( ) .
(unsupervised classification), .
- (semi-supervised classification), .
-3-
1.1.3
, . , , , . (term frequency vector).
, , ,
(dictionary).
( )
training set.
. ,
. , . , automobile
. . , automobile , 7,
-4-
, 0 .
. ,
. , , :
1: The Sun is a star.
2: The Earth is a planet.
3: The Earth is smaller than the Sun.
( , ):
= [a, Earth, is, planet, smaller, star, Sun, than, the]
, 9.
:
1 = [1, 0, 1, 0, 0, 1, 1, 0, 1]
2 = [1, 1, 1, 1, 0, 0, 0, 0, 1]
3 = [0, 1, 1, 0, 1, 0, 1, 1, 2]
, 0 1, 2
the 3.
. , a, is the. .
, .
stop words,
.
,
.
, .
-5-
( 2) .
1.2
: (related work),
SVD. SVD,
.
, Reuters-21578.
. .
,
.
-6-
2
.
,
SVD-Aggregation.
2.1 Flesh
Flesh
. Flesh, Flesh Reading Ease Flesh-Kincaid Grade Level.
,
, , . ,
Flesh Reading Ease
Flesh-Kincaid Grade Level.
Rudolf Flesh ( J. Peter Kincaid),
.
Flesh-Kincaid Grade Level 1975
J. Peter Kincaid Rudolf
Flesh. Flesh Reading Ease 1978
.
Flesh Reading Ease.
.
Flesh Reading Ease : , . -7-
. :
.
( ) 120, . ,
, . 0 100 .
Flesh
Reading Ease
90-100
80-89
70-79
60-69
50-59
30-49
0-29
-8-
90-100
60-70
0-30
-9-
2.2
. , .
-1 1. 0, 1, .
90 0 ,
.
180 -1.
, . ( ) ( ).
,
0 1, .
90.
.
-10-
. , A B,
:
1
.
1:
,
:
1: The Sun is a star.
2: The Earth is a planet.
3: The Earth is smaller than the Sun.
:
= [a, Earth, is, planet, smaller, star, Sun, than, the]
:
1 = [1, 0, 1, 0, 0, 1, 1, 0, 1]
2 = [1, 1, 1, 1, 0, 0, 0, 0, 1]
3 = [0, 1, 1, 0, 1, 0, 1, 1, 2]
-11-
, :
4: The Sun is not a planet.
:
4 = [1, 0, 1, 1, 0, 0, 1, 0, 1]
training set test set.
.
. , , . ,
,
. :
1 4:
2 4:
3 4:
4 1 2
3. 3 , 4 .
: test set
training set ( ). test train .
training . ,
training 1, test . ,
1, 2 3,
-12-
. test set.
1
2
training set
.
SVD
U .
. ( )
.
i
(1im) , (
-13-
) U .
m -
. 4 ( 0 ).
4:
1
1
x ,
.
, (aggregated matrix)
D . , , ,
. . M D. ,
M
-14-
D, M D
. , M (
D ) () U.
test set training set. test m:
V S S ( ).
S :
c, c S. , doc_new
U. U
training set. .
- , M
( )
.
.
.
, . 5
. , . ,
1, 2.
SVD ,
,
3).
-15-
5:
10
11
12
10
, 75%
. 25% . S , ( ) . , . 75%
.
S U V.
U, S
-16-
, . 1010 . :
. -
. ,
M:
-17-
6:
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
E1
-0.79
-0.14
-0.6
0.21
-1.88
0.36
-0.46
0.3
-0.71
E2
-0.79
-1.86
0.81
-0.35
-1.2
0.81
-0.32
1.52
0.01
E3
-0.14
-1.86
1.45
0.27
-0.65
0.84
0.2
2.07
0.65
E4
-0.6
0.81
1.45
0.34
-0.27
0.89
-0.33
0.12
-0.95
E5
0.21
-0.35
0.27
0.34
0.31
1.36
-0.45
1.5
-0.39
E6
-1.88
-1.2
-0.65
-0.27
0.31
0.23
-0.36
0.21
-0.69
E7
0.36
0.81
0.84
0.89
1.36
0.23
-0.69
0.23
-0.09
E8
-0.46
-0.32
0.2
-0.33
-0.45
-0.36
-0.69
0.47
-1.62
E9
0.3
1.52
2.07
0.12
1.5
0.21
0.23
0.47
-0.36
E10
-0.71
0.01
0.65
-0.95
-0.39
-0.69
-0.09
-1.62
-0.36
.
. , , ,
.
, test,
:
,
. , :
test .
training U. , -
-18-
U.
.
:
7:
0.4667
0.2153
0.0500
0.6460
0.0165
0.5662
0.2792
0.3283
0.7906
10
0.7395
9.
, , . ( 0 ) 0.3091, 0.1200, 0.2141, 0.2338 -0.3612.
1, 4, 6, 7 10. 1 5 1, 6 10 2. 1 2. test
2. 1
, . .
5, . ,
.
-19-
-20-
3.1
.
, .
,
. . ,
. SVD,
.
() .
a, on, in, and,
is, are , (
,
). .
, ,
.
. ,
, , .
stop words, -
-21-
. . SVD
,
. ,
,
.
. . .
, .
3.2 SVD
Singular Value Decomposition (SVD),
, . ,
, . ,
, .
, ,
. , SVD .
, .
2.
( ).
.
, .
-22-
2:
, ,
3.
. ,
,
. , .
3:
-23-
SVD,
.
3.2.1 SVD
SVD mn, :
U mm , S mn
A, V, nn. :
. U
, V
, S
U V . U, S
.
SVD . :
-24-
. A :
, :
-25-
, =12 -
[1,1].
. , , , . , =12 =10
:
,
Gram-Schmidt .
V . V
-26-
. =12
-27-
=12,
. =10
=10,
. =0
=0,
-28-
Gram-Schmidt :
V:
S
,
. U V ( -
=10 =12), .
SVD, SVD , S -29-
U V.
S A, U V .
3.2.2
SVD Latent Semantic Indexing
(LSI) Latent Semantic Analysis. SVD ,
, .
. ,
-30-
. SVD , . ,
.
, .
.
, SVD A
:
U V S . U
(), .
-31-
- .
U.
, ,
. U , . .
.
V .
Gram-Schmidt V:
S .
. , S
:
, U
-32-
A .
, , .
Singular Value Decomposition Tutorial Kirk
Baker, 2005.
-33-
. training
set.
8:
10
SVD 8
U, S V ( V,
-34-
):
.
. ,
) S.
S . S V
:
, test,
:
-35-
test set.
.
SVD
training set. , , ( test_new) :
, training set
-
. test_new U
test 1, 2,
. . 9 test 1 9.
9: test
-36-
test
-0.14
0.53
-0.33
0.57
-0.32
-0.38
0.02
-0.13
0.03
4 . , test . .
.
5. , , 5
, .
, 2, 4 9
.
,
, 4.
. , . 80% . S. . ,
. ,
( ) . .
51.85.
, 0.8:
-37-
U test_new
:
test_new ,
. test
( test_new) training set
( 10).
. 1 test. ,
, 1, 2 3,
.
SVD 5
-38-
.
.
10: test
test
0.97
0.87
0.69
0.37
-0.8
0.26
-0.67
0.2
-0.16
-39-
-40-
4 Reuters
SVD,
, Reuters-21578
.
.
4.1
.
.
, .
easy.txt, medium.txt
hard.txt.
. . , easy.txt , medium.txt hard.txt
( ).
,
.
. , . , .
, 4.
-41-
. , 100, 300
, 100 , .
4:
5 (
, ). 5.
5:
,
. .
, 10.
(
6).
-42-
6:
1/x.
x . ,
5, 1/5 0.2, 20% . ,
20% . , . ,
, easy.txt,
medium.txt hard.txt. .
, easy.txt
hard.txt, easy.txt medium.txt.
, ,
50% 50% . ,
,
50%
50%. 4
25%. , ok
Enter ( 7). 15
. 1 15,
( 8).
-43-
7: ok
8: 15
(1-5) ,
(6-10) (11-15). , -
-44-
E ( easy),
M ( medium) H (
hard). , Flesh
, . , 1/10
(10%) .
10 .
1/15 (6.67%) 15
. , 1/20 (5%), 20
. , , .
, , easy.txt, medium.txt hard.txt . easy.txt easy1, easy2 easy3,
medium.txt medium1, medium2 medium3, hard.txt hard1, hard2 hard3. ,
. , , easy1,
easy2 easy3 . (9, 10
11) ,
.
9 .
E . ,
( 25%). , .
9:
10 . M
. .
,
.
-45-
10:
11. . , ,
.
11:
4.2 Reuters
. .
Reuters-21578.
4.2.1
Reuters-21578 1987
Reuters .
Reuters Ltd. (Sam Dobbbins, Mike Topliss, Steve Weinstein) Carnegie Group,
Inc. (Peggy Andersen, Monica Cellio, Phil Haeys, Laura Knecht, Irene Nirenburg)
-46-
(tags) SGML
SGML , (
) .
. -
-47-
, ,
.
(ID), ,
1000 ID (
).
4.2.2
Reuters-21578 22 .
21 (reut2-000.sgm reut2-020.sgm) 1000 ,
(reut2-021.sgm) 578 . SGML .
:
reuters: :
o topics: yes, no bypass, , .
o lewissplit: training, test not-used. training training set David D. Lewis Representation and learning
in information retrieval, An evaluation of phrasal and clustered representations on a text categorization task, Feature selection an feature extraction for text categorization A comparison of two learning algorithms for text categorization. test test set non-used
.
o cgisplit: training-set published-testset training set test set
Philip J. Hayes A shell for content-based
text categorization A system for content-based indexing of a database of news stories.
o oldid: (ID) Reuters22173.
o newid: (ID) Reuters21578.
-48-
date: .
topics: topics (
).
companies: .
unknown:
.
text: :
o author: .
o dateline: , .
o title: .
o body: .
4.2.3
Reuters-21578 : exchanges, orgs,
people, places topics.
. ,
.
11:
Exchanges
39
Orgs
56
People
267
Places
175
Topics
135
-49-
11 . topics
. coconut, gold,
inventories money-supply".
Reuters.
exchanges, orgs, people places
. nastag (exchanges), gatt (orgs), perez-de-cuellar (people) australia (places). ,
,
(
topics). , , . . ,
.
-50-
5
SVD
Flesh, SVD-Aggregation .
,
, Reuters.
, :
Cosine:
SVD-Cos: SVD 3
5.1
. :
a:
b: ( )
c:
(
)
d: ( ,
)
-51-
Accuracy Error .
, (F-measure), Recall Precision:
5.2 0%
. .
4 .
100% F, .
Flesh
. , , ( , ).
12,
heat-map Aggregated
SVD ( )
Agg-SVD 30%.
.
100%. o
heat-map -
-52-
data set .
5.3 25%
25%.
. 12 .
100%,
SVD (SVD-Cos) 70%, 50% 30%, Aggregated
SVD 70% 30%. .
-53-
12:
Column1
Recall
Precision
Fallout
Accuracy
Error
F-measure
Cosine
100%
100%
0%
100%
0%
100%
Flesh
42%
100%
0%
80%
19%
60%
SVD-Cos 100%
29%
80%
3%
73%
26%
42%
Agg-SVD 100%
79%
42%
53%
57%
42%
55%
SVD-Cos 70%
100%
100%
0%
100%
0%
100%
Agg-SVD 70%
100%
60%
32%
79%
21%
76%
SVD-Cos 50%
100%
93%
3%
98%
2%
97%
Agg-SVD 50%
86%
92%
3%
93%
7%
89%
SVD-Cos 30%
100%
100%
0%
100%
0%
100%
Agg-SVD 30%
100%
100%
0%
100%
0%
100%
13:
Column1
Recall
Precision
Fallout
Accuracy
Error
F-measure
Cosine
100%
100%
0%
100%
0%
100%
Flesh
64%
50%
32%
66%
33%
56%
SVD-Cos 100%
93%
65%
25%
80%
19%
76%
Agg-SVD 100%
36%
83%
35%
76%
23%
50%
SVD-Cos 70%
100%
100%
0%
100%
0%
100%
Agg-SVD 70%
57%
89%
3%
83%
16%
70%
SVD-Cos 50%
93%
100%
0%
98%
2%
96%
Agg-SVD 50%
93%
87%
7%
93%
7%
90%
SVD-Cos 30%
100%
100%
0%
100%
0%
100%
Agg-SVD 30%
100%
100%
0%
100%
0%
100%
, 16 17,
.
. , Agg-SVD 30%, SVD-Cos 70%
SVD-Cos 30% .
14:
Column1
Recall
Precision
Fallout
Accuracy
Error
F-measure
Cosine
100%
100%
0%
100%
0%
100%
Flesh
92%
72%
17%
85%
14%
81%
SVD-Cos 100%
93%
76%
14%
88%
11%
84%
Agg-SVD 100%
57%
80%
7%
80%
19%
67%
SVD-Cos 70%
100%
100%
0%
100%
0%
100%
Agg-SVD 70%
71%
100%
0%
90%
9%
83%
SVD-Cos 50%
100%
100%
0%
100%
0%
100%
Agg-SVD 50%
100%
100%
0%
100%
0%
100%
SVD-Cos 30%
100%
100%
0%
100%
0%
100%
Agg-SVD 30%
100%
100%
0%
100%
0%
100%
13 F
. SVDCos . , , Agg-SVD,
.
-55-
13: F-measure
15, 14
SVD-Cos, SVD-Cos 50% .
SVD-Cos 30% . .
-56-
15 .
-57-
5.4 50%
50% . 15 . SVD-Cos 30%. Agg-SVD 30% Cosine.
SVD , .
Flesh .
.
.
, , , .
Flesh
() .
15:
Column1
Recall
Precision
Fallout
Accuracy
Error
F-measure
92%
86%
7%
92%
7%
89%
5%
3%
81%
15%
85%
2%
SVD-Cos 100%
71%
42%
50%
57%
42%
53%
Agg-SVD 100%
36%
33%
35%
55%
45%
34%
SVD-Cos 70%
99%
74%
17%
88%
11%
85%
Agg-SVD 70%
93%
52%
42%
69%
30%
67%
SVD-Cos 50%
99%
70%
21%
86%
14%
82%
Agg-SVD 50%
93%
59%
32%
76%
23%
72%
SVD-Cos 30%
99%
99%
0%
99%
0%
99%
Agg-SVD 30%
99%
88%
7%
95%
4%
93%
Cosine
Flesh
-58-
16:
Column1
Recall
Precision
Fallout
Accuracy
Error
F-measure
Cosine
78%
78%
10%
85%
14%
78%
Flesh
92%
46%
53%
61%
38%
61%
SVD-Cos 100%
43%
55%
17%
69%
30%
48%
Agg-SVD 100%
29%
33%
28%
57%
42%
30%
SVD-Cos 70%
86%
99%
0%
95%
4%
92%
Agg-SVD 70%
38%
71%
71%
74%
26%
48%
SVD-Cos 50%
71%
90%
3%
88%
11%
80%
Agg-SVD 50%
43%
55%
17%
69%
30%
48%
SVD-Cos 30%
99%
78%
14%
90%
9%
88%
Agg-SVD 30%
86%
75%
14%
86%
14%
80%
16 .
. SVD-Cos, Agg-SVD,
Cosine Flesh. SVD-Cos. 70%
30% . Agg-SVD 30% 50%
( F) 70% 100%.
. Flesh
. Flesh, F-measure .
, ,
.
Recall,
, Precision, . ,
. , , -
-59-
.
, . ,
, ,
.
17:
Column1
Recall
Precision
Fallout
Accuracy
Error
F-measure
Cosine
85%
92%
35%
92%
7%
88%
Flesh
85%
85%
7%
90%
10%
85%
SVD-Cos 100%
36%
71%
71%
73%
26%
48%
Agg-SVD 100%
29%
27%
39%
50%
50%
28%
SVD-Cos 70%
79%
99%
0%
93%
7%
88%
Agg-SVD 70%
64%
90%
3%
86%
14%
75%
SVD-Cos 50%
71%
90%
3%
88%
11%
80%
Agg-SVD 50%
57%
89%
3%
83%
16%
70%
SVD-Cos 30%
71%
99%
0%
90%
9%
83%
Agg-SVD 30%
57%
80%
7%
80%
19%
67%
17 .
.
Flesh . Recall ,
Precision . , ,
70% , 30% .
17 F
.
SVD-Cos. Agg-SVD, SVD-Cos
. -60-
SVD-Cos, Flesh
.
17: F-measure
18 F SVD-Cos.
30% , 70%,
30% . -
-61-
, 30% 70% .
19 F Agg-SVD . 100% . 70% 50%
. 30%
.
.
20: Recall-Precision
-62-
20, 21 22 Recall-Precision .
.
SVD-Cos Agg-SVD . Flesh
. ,
. Flesh Agg-SVD, .
21: Recall-Precision
22: Recall-Precision
23 Recall Precision .
-63-
23: Recall-Precision
, heat-map
Agg-SVD ( 24). heat-map , .
-64-
5.5 Reuters
. , Reuters 21578 . , , . Flesh,
.
, ,
. , , coffee, gold ship.
Topic
. coffee
, gold
ship .
18
coffee. SVD-Cos 50%,
30% . Agg-SVD
, 30% . ,
.
18: coffee
Column1
Recall
Precision
Fallout
Accuracy
Error
F-measure
Cosine
95%
60%
48%
72%
27%
75%
SVD-Cos 100%
58%
94%
2%
80%
19%
71%
Agg-SVD 100%
85%
46%
74%
50%
49%
59%
SVD-Cos 70%
88%
92%
5%
92%
8%
90%
Agg-SVD 70%
69%
60%
34%
67%
32%
64%
SVD-Cos 50%
92%
98%
1%
97%
3%
96%
Agg-SVD 50%
77%
71%
22%
77%
22%
74%
SVD-Cos 30%
92%
96%
2%
95%
5%
94%
Agg-SVD 30%
69%
90%
5%
83%
16%
78%
-65-
Recall
Precision
Fallout
Accuracy
Error
F-measure
Cosine
64%
84%
4%
86%
13%
73%
SVD-Cos 100%
97%
43%
52%
62%
37%
60%
Agg-SVD 100%
47%
62%
11%
77%
22%
53%
SVD-Cos 70%
88%
83%
6%
92%
8%
86%
Agg-SVD 70%
47%
80%
45%
82%
18%
59%
SVD-Cos 50%
94%
89%
45%
95%
5%
91%
Agg-SVD 50%
65%
55%
20%
75%
24%
59%
SVD-Cos 30%
88%
79%
9%
90%
10%
83%
Agg-SVD 30%
88%
68%
15%
85%
14%
77%
20: ship
Column1
Recall
Precision
Fallout
Accuracy
Error
F-measure
Cosine
22%
80%
2%
75%
24%
34%
SVD-Cos 100%
28%
97%
1%
79%
21%
43%
Agg-SVD 100%
25%
81%
2%
70%
29%
36%
SVD-Cos 70%
89%
89%
46%
93%
7%
89%
Agg-SVD 70%
78%
67%
16%
82%
18%
72%
SVD-Cos 50%
94%
89%
4%
95%
5%
92%
Agg-SVD 50%
44%
62%
11%
75%
24%
52%
SVD-Cos 30%
78%
82%
7%
89%
11%
80%
Agg-SVD 30%
67%
63%
16%
77%
21%
65%
-66-
20 ship
19 Cosine .
25: F-measure
25 F-measure
.
-67-
28, 29 30 Recall-Precision,
. SVD-Cos
. ,
Agg-SVD Cosine,
(Precision) Agg-SVD
Recall. SVD
, . 30%
Agg-SVD 50% SVD-Cos.
-68-
,
31. Recall-Precision
, SVD-Cos
.
Agg-SVD.
-69-
-70-
6
. ,
Singular Value Decomposition (SVD) . ,
Flesh Reading Ease.
,
, . ,
Flesh Reading Ease,
Aggregated SVD. , Flesh Reading Ease . , Aggregated
SVD, SVD Aggregation
SVD, ,
.
, ,
SVD . , Reuters-21578.
: .
,
: , . , ,
0%, 25% 50% . , -71-
Reuters, ,
.
SVD , .
SVD
. ,
. ,
,
.
-72-
-73-
Baeza-Yates, R., Ribeiro-Neto B. (1999) Modern Information Retrieval, Addison
Weslay Longman Inc.
Baker, K. (2005) Singular Value Decomposition Tutorial, http://www.cs.wits.ac.za/
~michael/SVDTut.pdf
Balabanovic, M., Shoham, Y. (1997) Fab: Content-based, collaborative recommendation, ACM Communications, volume 40, number 3, pages 66-72
Bharat, K., Henzinger M. R. (1998) Improved Algorithms for Topic Distillation in a
Hyperlinked Enviroment, Proc. ACM Conf. Res. and Developments in Information Retrieval
Chakrabarti, S., Dom, B., Indyk, P. (1998) Enhanced Hypertext categorization Using
Hyperlinks, Proc. ACM SIGMOD international Conference on Management Data, volume 27, number 2, pages 307-318
Edelstein, H. (1999) Introduction to Data Mining and Knowledge Discovery. 3rd ed.,
Two Crows Corporation
Furnas, G., Deerwester, (1988) Information retrieval using a singular value
decomposition model of latent semantic structure, SIGIR, pages 465-480
Guan, H., Zhou, J., Guo, M. (2009) A Class-Feature-Centroid Classifier for Text Categorization, WWW 2009, Madrid, pages 201-210
Han, J., Kamber, M. (2001) Data Mining: Concepts and Techniques. CA: Morgan
Kaufmann, San Francisco
Hans-Henning, G., Spiliopoulou, M., Nanopoulos, A. (2010) Eigenvector-Based Clustering Using Aggregated Similarity Matrices. ACM SAC 10, pages 1083-1087
Herlocker, J., Konstan, J., Terveen, L., Riedl, J. (2004) Evaluating Collaborative Filtering Recommender System, ACM Trans. on Information Systems, volume 22, number 1,
pages 5-53
-75-
Joachims, T. (1998) Text categorization with support vector machines: Learning with
many relevant features, Springer Verlag, pages 137-142
Kim, H., Howland, P., Park H. (2006) Dimension Redeuction in Text Classificaton with
Support Vector Machines, Journal of machine Learning Research, volume 6, number 1,
pages 37-53
Lewis, D. D. (1991) Evaluating text categorization, HLT 91 Proceedings of the workshop on Speech and Natural Language, 312-318
Melville, P., Mooney, R. J., Nagarajan, R. (2002) Content-Boosted Collaborative filtering for Improved Recommendations, AAAI, pages 187-192
Montanes, E., Diaz, I., Ranilla, J., Combarro, E. F., Fernandez J. (2005) Scoring and
selecting terms for text categorization, IEEE Intelligent Systems, volume 20, number 3,
pages 40-47
Papadimitriou, C. H., Raghavan, P., Tamaki, H., Vempala, S. (1998) Latent Semantic
Indexing: a Probabilistic Analysis, Proc. ACM Symposium on Pronciples of Database
systems, volume 61, number 2, pages 217-235
Pei, Z. L., Shi, X. H., Marchese, M., Liang, Y. C., (2007) An enhanced text categorization method based on improved text frequency approach and mutual information algorithm, Progress in Natural Science, volume 17, number 12, pages 1494-1500
Robertson, S. E., Spark Jones K. (1976) Relevance weighting of search terms, Journal
of the American Society for Information Sciences, volume 27, number 3, pages 129-146
Sabwar, B., Karypis, G., Konstan, J., Riedl, J. (2000) Application of dimensionality reduction in recommender system A case study, ACM WebKDD Workshop, volume
1625, number 1, pages 285-295
Salton, G., Buckley, C., (1988) Term weighting approaches in automatic retrieval, Information Processing & Management, volume 24, number 5, pages 513-523
Soucy, P., Mineau, G. W. (2003) Feature selection strategies for text categorization,
Advances in Artificial Intelligence, Proceedings, 2671, pages 505-509
Symeonidis, P. Content-based Dimensionality Reduction for Recommended System,
GfKl 2007, pages 619-626
Yang, Y., J. Pedersen, J.(1997) A comparative study on feature selection in text categorization, In International Conference on Machine Learning (ICML), pages 412-420
-76-
-77-
1:
. Flesh Reading Ease http://flesh.sourceforge.net ( cd Flesh). Microsoft Visual Studio 2008 Matlab R2010a Windows 7.
1.1 Generator
. , Generator cd
. MS Visual Studio ( 2008 )
FileOpenProject/Solution.
Generator .
Generator.sln . Open, Generator .
F5.
,
.
. 90 , 30 ,
150 50% ( 32).
training set. /Generator/Generator/
( easy.txt, medium.txt hard.txt
). 90 (
). (..
) (.. training_set) -79-
90 . training
set .
, training set
test set. O . 1.3 .
1.2 Reuters
, Reuters.
, ,
. , (.. ) training_set test_set.
, Reuters 22
(reut2-000.sgm reut2-021.sgm), 21 578 . Reuters21578. (..
Notepad++) 22
sgml tags. TOPIC .
-80-
tag .
.
( .. ,
coffee, gold ship).
earn .
22 Find earn. :
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="5757" NEWID="214">
<DATE>26-FEB-1987 19:16:39.70</DATE>
<TOPICS><D>earn</D></TOPICS>
<PLACES><D>usa</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
F
f0395reute
u f BC-AVERY-<AVY>-SETS-TWO 02-26 0086</UNKNOWN>
<TEXT>
<TITLE>AVERY <AVY> SETS TWO FOR ONE STOCK SPLIT</TITLE>
<DATELINE>
board authorizerd
a two for one stock split, an increased in the quarterly
dividend and plans to offer four mln shares of common stock.
The company said the stock split is effective March 16 with
a distribution of one additional share to each shareholder of
record March 9.
It said the quarterly cash dividend of 10.5 cts per share
-81-
on the split shares, a 10.5 pct increase from the 19 cts per
share before the split.
Avery said it will register with the Securities and
Exchange Commission shrortly to offer four mln additional
common shares. It will use the proceeds to repay debt, finance
recent acquisitions and for other corporate purposes.
Reuter
</BODY></TEXT>
</REUTERS>
tag TOPIC (earn). tag BODY
. , tag BODY,
, txt
. training_set . ,
Reuters , earn (
, 22 Reuters).
, , txt, training_set.
. txt .
, txt
E
. , E
Reuters.
.
,
TOPIC .
. E
M H. training_set
txt .
-82-
test_set ,
TOPIC training set. , , earn
E, M H . ,
, training set , training_set test_set.
training set set ,
sgml tags.
cd Reuters training set test set
coffee, gold ship,
.
1.3
, Diplomatiki
cd .
MS Visual Studio ( 2008 )
FileOpenProject/Solution.
Diplomatiki .
Diplomatiki.sln . Open, Diplomatiki .
, Diplomatiki.cpp Solution Explorer MS
Visual Studio. ,
training set test set. 49 50 () training set test set ,
34:
-83-
F5 .
Matlab (
R2010a). , Matlab
. ,
,
35:
35:
284, training
set 90 test set 30.
SVD-Cos
Agg-SVD. 0 1.
30% ,
0.3 Enter . ,
, U SVD
( 36). 8
30% .
36:
, .
Easy , Medium
Hard . Reuters
Easy, Medium Hard E, M H txt, .
-84-
37:
38: SVD
-85-
-86-