You are on page 1of 4

The 25th Annual Conference of the Japanese Society for Articial Intelligence, 2011

IDHTML

2E2-3in

Detecting Splogs by Integrating Machine Learning /


Aliate IDs / Similarity of HTML Structures

Akihito Morijiri

Taichi Katayama

Takehito UTSURO

Yasuhide Kawada

Tomohiro FUKUHARA


Graduate School of Systems and Information Engineering,University of Tsukuba

NTT
NTT Cyber Space Laboratories, NTT Corporation
4

()
Navix Co., Ltd.

Center for Service Research, National Institute of Advanced Industrial Science and Technology

Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for
the sole purpose of hosting advertisements or raising the number of in-links of target sites. It has been shown
that splogs can be detected based on similarity of HTML structures and aliate IDs. The similarity of HTML
structures of splogs is eective in splog detection, and the identity of aliate IDs extracted from splogs can identify
spammers much more directly than similarity of HTML structures, although it is not easy to achieve high coverage
in extracting aliate IDs. The coverage of the intersection of the two clues, similarity of HTML structures and
aliate IDs, is relatively low, and it is necessary to apply them in a complementary strategy. This paper studies
how to detect splogs which cannot be detected based on either similarity of HTML structures nor aliate IDs. We
apply SVMs to this task and show that splogs of above type can be detected with high precision.

1.

/
[Kolari 06b, Kolari 07] BlogPulse

[ 08, Kolari 06a, Mishne 05, Lin 07]


HTML

HTML
ID

[ 10b]

HTML

3
1 1
[ 10] ID

ID
3
1 2
2
[ 10a, Katayama 11]
2

HTML ID
( 1 4
) Support Vector
Machines (SVMs) [Vapnik 98]

TechnoratiBlogPulse [Glance 04]


kizasi.jp
Globe of Blogs
Best Blogs in Asia Directory

ongyi 05,
() [Gy
Kolari 06b, Macdonald 06, Kolari 07]

[Kolari 06b]
88%
75%
[Lin 07]

[Kolari 06b, Macdonald 06, Kolari 07]


[Macdonald 06] TREC
Blog06
:
305-8573 1-1-1029-853-5427

The 25th Annual Conference of the Japanese Society for Articial Intelligence, 2011

1: 10 ID
ID
H

ID

S
F

72
56





 SPaf (H, x)
x

1:

2.

1,101
953

2: ID

ID

(S F ) 10
ID
129 ID
ID 2,472

ID ID 121
ID 2,054
( ID 93.8%
83.3%)2 121
ID x H
ID x SPaf (H, x)
ID
ID
1

2 ID

ID ID ID
ID
[ 10] ASP(
)
ID 10 1
ID

KANSHIN [ 07]

RSS
Atom

S F 2
11 S 48,183
14,352 ( 30%) F 60,977
6,231 ( 10%) ID

ASP
ID

ID
ID
ID
ID
ID

[ 10]
ID
ID
[ 10]
ID
ASP10
ID 20,583

3.

SVM TinySVM
(http://chasen.org/~taku/software/TinySVM/)
[ 11]
2
2 4.
2
.
SVM
[Tong 00]

LBDs
LBDab

2 ID
ID

ID
ID
ID
ID
ID

1 Am At D Gl I Lk R St Tr
V

The 25th Annual Conference of the Japanese Society for Articial Intelligence, 2011

(a)

(b)

3: (S )

(a)

(b)

4: (F )

4.

ID

ID S 72
ID F 56 ID
ID
ID
500 () SVM

ID

ID
S 13 ID F
12 ID
ID
ID
500 ()
SVM
500
3

ID ( 1 2
3)
SVM
HTML ID
( 1 4)

4.1

2. S 48,183 F
60,977 10
ID
1 S 1,101 F 953

ID

ID

ID

ID

The 25th Annual Conference of the Japanese Society for Articial Intelligence, 2011

4.2

ID [ 10]

2. S 48,183
F 60,977
10
ID 10
ID HTML
(MinDF > 0.15)4 S 47,029 F
59,982
1 4

S 283 F 268
/

4.3

5.

HTML
ID

HTML
ID
SVM

LBDs LBDs

LBDs
S 3(a) F
4(a)

LBDab
LBDab

LBDab
S 3(b) F 4(b)

3 4

ID
ID
ID

ID

4.4

[ 07] , , ,
, 21
(2007)
[Glance 04] Glance, N., Hurst, M., and Tomokiyo, T.: BlogPulse:
Automated Trend Discovery for Weblogs, in Proc. Workshop on
the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2004)
[Gy
ongyi 05] Gy
ongyi, Z. and Garcia-Molina, H.: Web Spam Taxonomy, in Proc. 1st AIRWeb, pp. 3947 (2005)
[ 08] , Letters,
Vol. 6, No. 4, pp. 3740 (2008)
[ 10] , , , ID
, WebDB2010 (2010)
[ 10a] , , , , ,
HTML
, WebDB2010 (2010)
[ 10b] , , , , HTML
, 2 DEIM
(2010)
[Katayama 11] Katayama, T., Morijiri, A., Ishii, S., Utsuro, T.,
Kawada, Y., and Fukuhara, T.: Comparing Similarity of HTML
Structures and Aliate IDs in Splog Analysis, in Xu, J., et al. eds.,
Proc. 16th DASFAA, Inter. Workshops: SNSMW, Vol. 6637 of
LNCS, pp. 378389, Springer (2011)

[Kolari 06a] Kolari, P., Finin, T., and Joshi, A.: SVMs for the Blogosphere: Blog Identication and Splog Detection, in Proc. 2006
AAAI Spring Symp. Computational Approaches to Analyzing
Weblogs, pp. 9299 (2006)

3 4
ID

ID

ID

3(a) 4(a)
S 35%F
60%90%
3(b)
4(b) S 17%
F 15%90%

4.2 1
4HTML
ID

HTML [ 10b]

[Kolari 06b] Kolari, P., Joshi, A., and Finin, T.: Characterizing the
Splogosphere, in Proc. 3rd Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2006)
[Kolari 07] Kolari, P., Finin, T., and Joshi, A.: Spam in Blogs and
Social Media, in Tutorial at ICWSM (2007)
[Lin 07] Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J., and
Tseng, B. L.: Splog Detection using Self-similarity Analysis on
Blog Temporal Dynamics, in Proc. 3rd AIRWeb, pp. 18 (2007)
[Macdonald 06] Macdonald, C. and Ounis, I.: The TREC Blogs06
Collection : Creating and Analysing a Blog Test Collection, Technical Report TR-2006-224, University of Glasgow, Department of
Computing Science (2006)
[Mishne 05] Mishne, G., Carmel, D., and Lempel, R.: Blocking Blog
Spam with Language Model Disagreement, in Proc. 1st AIRWeb
(2005)
[ 11] , , , , ,
HTML
, 3 DEIM (2011)
[Tong 00] Tong, S. and Koller, D.: Support Vector Machine Active
Learning with Applications to Text Classication, in Proc. 17th
ICML, pp. 9991006 (2000)
[Vapnik 98] Vapnik, V. N.:
Interscience (1998)

4 MinDF [ 10b]

Statistical Learning Theory, Wiley-

You might also like