You are on page 1of 6

THE INSTITUTE OF ELECTRONICS,

INFORMATION AND COMMUNICATION ENGINEERS

TECHNICAL REPORT OF IEICE.

305-8573 1-1-1
305-8573 1-1-1
101-8457 2-2
() 141-0031 8-3-6
277-8568 5-1-5
SVM
SVM
SVM

SVM

Estimating Condence in Machine Learning for Splog Detection


Taichi KATAYAMA , Yuuki SATO , Takehito UTSURO , Takayuki YOSHINAKA , Yasuhide
KAWADA , and Tomohiro FUKUHARA
College of Engineering Systems, Third Cluster of Colleges, University of Tsukuba, Tsukuba, 305-8573,
Japan
Grad. Sch. Systems and Information Engineering, University of Tsukuba, Tsukuba, 305-8573, Japan
Graduate School of Engineering, Tokyo Denki University, Tokyo, 101-8457, Japan
Navix Co., Ltd. 8-3-6 Nishi-Gotanda, Shinagawa-Ku Tokyo 141-0031, Japan
Research into Artifacts, Center for Engineering, University of Tokyo Kashiwa, Chiba 277-8568, Japan
Abstract This paper studies the issue of identifying spam blogs (splogs) by SVM. In an SVM based classier, a
separating hyperplane is used for classifying splogs and authentic blogs. In our study, we further utilize the distance
from the separating hyperplane to an instance as a condence measure. In our approach to semi-automatic collection
of splogs and re-training of the SVM based classier, we consider this condence measure and identify blog sites
which cannot be reliably judged whether splogs or authentic blogs. We also examine whether we can automatically
identify professional spammers who automatically create a number of similar splogs.
Key words spam blog, machine learning, condence measure, SVM

1.


Technorati 1BlogPulse 2kizasi.jp 3

blogWatcher 4[1]
Globe of Blogs

2.
SVM

Best Blogs in Asia Directory 6

2. 1

Blogwise

2. 1. 1

()

[12], [13]

[2][6]

[4] 88%

75%

[3], [7]

2. 1. 2

[5]

TREC 8Blog06

URL

[4], [6] BlogPulse

[4], [8]

[10]

Support Vector Machines [11] (SVM)

2. 2
2. 3

SVM

SVM

URL URL

2. 3. 1

SEO

2. 3. 2 URL URL

[8], [9]

1
http://technorati.com/
2
http://www.blogpulse.com/

3. SVM

3
http://kizasi.jp/ ()

3. 1 SVM

4
http://blogwatcher.pi.titech.ac.jp/ ()

SVM TinySVM 10

5
http://www.globeofblogs.com/
6
http://www.misohoni.com/bba/
7
http://www.blogwise.com/
8
http://trec.nist.gov/

9 (http://chasen-legacy.sourceforge.jp/) ipadic

10http://chasen.org/~taku/software/TinySVM/

1: 2: 3: 4:
1

1: +2: +3: +
2

[12], [13]

2. 2

4647 1695

F 3

761 934

F 0.875

10

F 0.902

SVM

4. 2

[12], [13]

4.

ID

4. 1

ID

ID=15

4 (a)

3 SVM (
+)

4 (b)
ID=2

ID=2

5 (a)
(b) ID=2

(b)

5.

90%

ID

[1] T. Nanno, T. Fujiki, Y. Suzuki, and M. Okumura. Automatically collecting, monitoring, and mining Japanese weblogs.
In WWW Alt. 04: Proceedings of the 13th international
World Wide Web conference on Alternate track papers &
posters, pp. 320321. ACM Press, 2004.
[2] Z. Gy
ongyi and H. Garcia-Molina. Web spam taxonomy. In
AIRWeb 05: Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, pp.
3947, 2005.

[3] Wikipedia, Spam blog. http://en.wikipedia.org/wiki/


Spam blog.
[4] P. Kolari, A. Joshi, and T. Finin. Characterizing the splogosphere. In Proceedings of WWW 2006 3rd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis
and Dynamics, 2006.
[5] C. Macdonald and I. Ounis. The TREC Blogs06 collection
: Creating and analysing a blog test collection. Technical
Report TR-2006-224, University of Glasgow, Department of
Computing Science, 2006.
[6] P. Kolari, T. Finin, and A. Joshi. Spam in blogs and social
media. In Tutorial at ICWSM, 2007.
[7] Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L.
Tseng. Splog detection using self-similarity analysis on blog
temporal dynamics. In AIRWeb 07: Proceedings of the 3rd
International Workshop on Adversarial Information Retrieval on the Web, pp. 18, 2007.
[8] . .
Letters, Vol. 6, No. 4, pp. 3740, 2008.
[9] .
. Web
(WebDB Forum)2008 . , 2008.
[10] P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere:
Blog identication and Splog detection. In Proceedings of
the 2006 AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, pp. 9299, 2006.
[11] V. N. Vapnik.
Statistical Learning Theory.
WileyInterscience, 1998.
[12] , , , , , ,
.
. 19
6
(DEWS2008) , 2008.
[13] , , , , , ,
.
. 22 , 2008.

(a)

(b)
4 (F )

(a)

F
F

(b) ID=2
5 ID=2

You might also like