Professional Documents
Culture Documents
Korean Newspaper Mining
Korean Newspaper Mining
kr
*
1. Background
2. Related Work & Literature Review
3. Technique Method
4. Results
5. Conclusions
2
1. Background
1. Background
Research Questions
RQ: ,
?
RQ1. ,
?
RQ2. ,
-
?
RQ3. - ,
?
4
1. Background
The Goal of this study
, ,
6
, , , , 5
,
,
Clustering ,
Classification
( - )
Classification
2010 10 4 23 10
1
1
:
( , 2008)
( , 2006)
( 2002)
( ), (
), ( )
LED
, ,
( 2011)
7 2011)
3. Technique Method
, PNNC
MALLET Package
Clustering
Classification
3. Technique Method
: 3,026
: URL
URL HTML
, ,
3. Data Preprocessing
1.
, ,
10
: 4
2.
2008.12.29- 2009-01.29
2012.04.25- 2012.05.25
:
3.
2012.04.25- 2012.05.25
4.
2012.03.15- 2012.04.15
:
5.
2012.04.25- 2012.05.25
6.
2012.04.25- 2012.05.25
HTML,
Lucene Korean (
)
Java ,
KLT
11
(,,,,,.,(,),,|,,,,,,,,,)
12
Clustering
/ PNNC ( 2006)
13
3. Technique Method
Classification:
MALLET Package
Naive Bayes
(Precision), (Recall), F-value
70:30
( )
750
1400
350
150
2000
150
( )
150
150
100
50
150
50
14
4. Results
10
( )
:
:
0.47
15
F-value
0.479339
0.386667
0.428044
0.460674
0.546667
0.5
0.483444
0.486667
0.48505
4. Results
10
( )
: 6
0.7
, F-value 0.5 ->
16
F-value
0.401515
0.706667
0.512077
0.346154
0.18
0.236842
0.425926
0.306667
0.356589
4. Results
( )
17
Clustering
Clustering
4. Results
10
( )
:
,
0.59
18
F-value
0.384615
0.5
0.434783
0.511628
0.44
0.473118
0.595238
0.5
0.543478
4. Results
10
( )
:
, ,
, :
19
F-value
0.430233
0.37
0.397849
0.5125
0.41
0.455556
0.395522
0.53
0.452991
4. Results
: ,
10
( )
:
:
: 0.6
0.84, F-value 0.81,
0.8
20
F-value
0.847826
0.78
0.8125
0.655738
0.8
0.720721
0.627907
0.54
0.580645
4. Results
( )
21
4. Results
: , ,
, ,
10
(4 )
22
F-value
0.435897
0.34
0.382022
0.349398
0.58
0.43609
0.642857
0.36
0.461538
4. Results
(4 )
23
4. Results
(4 )
4 ,
150
50
3 , F-value 0.6
, 0.7
24
F-value
0.672414
0.78
0.722222
0.62
0.738095
0.673913
0.672414
0.78
0.722222
4. Results
(4 )
4
2012.04.08- 2012.07.08
: 4
: 97 / : 85 / : 47
F-value 72%
67%
70%
25
4. Results
(4 )
4
4 70%
7% , 3%
26
5. Conclusions
4
,
Clustering
Classification , 70%
27
5. Conclusions
3,000
clustering
Classification
Classification 0.7 :
28
5. Conclusions
29
: Topic Modeling
: , , ,
2010 11
1 2012 10 31
(3,928)
(8,110)
(4,182)
(1,450)
(3,244)
(1,794)
(2,008)
(4,359)
(2,351)
1,880
2,048
2,213
1,969
685
765
937
857
960
1,048
1,304
1,047
Dirichlet ,
. ,
,
topic
Topic Modeling
10 topic .
7 topic
Topic 1.
Topic 2.
Topic 3.
Topic 4.
Topic 5.
Topic 6.
Topic 7.
Topic 1.
, ,
, , ,
, , ,
, ,
, , ,
, ,
, ,
, ,
, ,
, ,
, ,
, ,
, , , ,
, , , ,
Topic 1.
, , , ,
, , ,
, , ,
, , , ,
, , , ,
, , , , , ,
, , ,
, ,
, , , ,
,
, ,
Topic 1.
, , , ,
, , , , , ,
, , ,
, ,
, , , , ,
, , , , ,
, , , , ,
, , ,
, ,
* References
, , :
, , 41 (2008), 232~267.
. . Accessed 2012.04.12,
<http://www.mediatoday.co.kr/news/articleView.html?idxno=91565>.
, , .
2003 , 2003 11 , . 574~580.
, , : , ,
, , 34 (2006), 132~162.
, , , 40
3 (2006a), 191~214.
, , , 23 4
(2006b), 215~231.
, , , , 2005.
, , , 46 4 (2002), 314~348.
, , , , , 17 4
(2011), 227~240.
, , , 2010.
36
* References
Carlos H. Caldas, and Lucio Soibelman, Automating hierarchical document classification for
construction management information systems, Journal of Automation in Construction,
Vol.12(2003), 395~406
Pollak, Senja, Roel Coesemans, Walter Daelemans and Nada Lavra, Detecting Contrast
Patterns in Newspaper Articles by Combining Discourse Analysis and Text Mining, Pragmatics,
Vol.21, No.4(2011), 647~683.
Balahur, A., and R. Steinberger, Rethinking sentiment analysis in the news: From theory to
practice and back, In Proceedings of the 1st Workshop on Opinion Mining and Sentiment
Analysis, Satellite to CAEPIA 2009, (2009).
37
38
1.
1)
0.13967
13472
0.08543
8610
( /
)
0.40472
26825
0.25105
26598
( )
0.11217
13680
0.10259
15239
0.05549
10756
0.21487
16733
0.24956
17256
0.25473
20971
0.17779
17851
0.21742
18208
2)
0.16248
11974
0.16728
13883
0.09057
8870
0.12418
12505
0.43681
35593
0.25515
27772
0.05842
6672
0.06781
6904