Professional Documents
Culture Documents
IDHTML
2E2-3in
Akihito Morijiri
Taichi Katayama
Takehito UTSURO
Yasuhide Kawada
Tomohiro FUKUHARA
Graduate School of Systems and Information Engineering,University of Tsukuba
NTT
NTT Cyber Space Laboratories, NTT Corporation
4
()
Navix Co., Ltd.
Center for Service Research, National Institute of Advanced Industrial Science and Technology
Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for
the sole purpose of hosting advertisements or raising the number of in-links of target sites. It has been shown
that splogs can be detected based on similarity of HTML structures and aliate IDs. The similarity of HTML
structures of splogs is eective in splog detection, and the identity of aliate IDs extracted from splogs can identify
spammers much more directly than similarity of HTML structures, although it is not easy to achieve high coverage
in extracting aliate IDs. The coverage of the intersection of the two clues, similarity of HTML structures and
aliate IDs, is relatively low, and it is necessary to apply them in a complementary strategy. This paper studies
how to detect splogs which cannot be detected based on either similarity of HTML structures nor aliate IDs. We
apply SVMs to this task and show that splogs of above type can be detected with high precision.
1.
/
[Kolari 06b, Kolari 07] BlogPulse
HTML
ID
[ 10b]
HTML
3
1 1
[ 10] ID
ID
3
1 2
2
[ 10a, Katayama 11]
2
HTML ID
( 1 4
) Support Vector
Machines (SVMs) [Vapnik 98]
ongyi 05,
() [Gy
Kolari 06b, Macdonald 06, Kolari 07]
[Kolari 06b]
88%
75%
[Lin 07]
The 25th Annual Conference of the Japanese Society for Articial Intelligence, 2011
1: 10 ID
ID
H
ID
S
F
72
56
SPaf (H, x)
x
1:
2.
1,101
953
2: ID
ID
(S F ) 10
ID
129 ID
ID 2,472
ID ID 121
ID 2,054
( ID 93.8%
83.3%)2 121
ID x H
ID x SPaf (H, x)
ID
ID
1
2 ID
ID ID ID
ID
[ 10] ASP(
)
ID 10 1
ID
KANSHIN [ 07]
RSS
Atom
S F 2
11 S 48,183
14,352 ( 30%) F 60,977
6,231 ( 10%) ID
ASP
ID
ID
ID
ID
ID
ID
[ 10]
ID
ID
[ 10]
ID
ASP10
ID 20,583
3.
SVM TinySVM
(http://chasen.org/~taku/software/TinySVM/)
[ 11]
2
2 4.
2
.
SVM
[Tong 00]
LBDs
LBDab
2 ID
ID
ID
ID
ID
ID
ID
1 Am At D Gl I Lk R St Tr
V
The 25th Annual Conference of the Japanese Society for Articial Intelligence, 2011
(a)
(b)
3: (S )
(a)
(b)
4: (F )
4.
ID
ID S 72
ID F 56 ID
ID
ID
500 () SVM
ID
ID
S 13 ID F
12 ID
ID
ID
500 ()
SVM
500
3
ID ( 1 2
3)
SVM
HTML ID
( 1 4)
4.1
2. S 48,183 F
60,977 10
ID
1 S 1,101 F 953
ID
ID
ID
ID
The 25th Annual Conference of the Japanese Society for Articial Intelligence, 2011
4.2
ID [ 10]
2. S 48,183
F 60,977
10
ID 10
ID HTML
(MinDF > 0.15)4 S 47,029 F
59,982
1 4
S 283 F 268
/
4.3
5.
HTML
ID
HTML
ID
SVM
LBDs LBDs
LBDs
S 3(a) F
4(a)
LBDab
LBDab
LBDab
S 3(b) F 4(b)
3 4
ID
ID
ID
ID
4.4
[ 07] , , ,
, 21
(2007)
[Glance 04] Glance, N., Hurst, M., and Tomokiyo, T.: BlogPulse:
Automated Trend Discovery for Weblogs, in Proc. Workshop on
the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2004)
[Gy
ongyi 05] Gy
ongyi, Z. and Garcia-Molina, H.: Web Spam Taxonomy, in Proc. 1st AIRWeb, pp. 3947 (2005)
[ 08] , Letters,
Vol. 6, No. 4, pp. 3740 (2008)
[ 10] , , , ID
, WebDB2010 (2010)
[ 10a] , , , , ,
HTML
, WebDB2010 (2010)
[ 10b] , , , , HTML
, 2 DEIM
(2010)
[Katayama 11] Katayama, T., Morijiri, A., Ishii, S., Utsuro, T.,
Kawada, Y., and Fukuhara, T.: Comparing Similarity of HTML
Structures and Aliate IDs in Splog Analysis, in Xu, J., et al. eds.,
Proc. 16th DASFAA, Inter. Workshops: SNSMW, Vol. 6637 of
LNCS, pp. 378389, Springer (2011)
[Kolari 06a] Kolari, P., Finin, T., and Joshi, A.: SVMs for the Blogosphere: Blog Identication and Splog Detection, in Proc. 2006
AAAI Spring Symp. Computational Approaches to Analyzing
Weblogs, pp. 9299 (2006)
3 4
ID
ID
ID
3(a) 4(a)
S 35%F
60%90%
3(b)
4(b) S 17%
F 15%90%
4.2 1
4HTML
ID
HTML [ 10b]
[Kolari 06b] Kolari, P., Joshi, A., and Finin, T.: Characterizing the
Splogosphere, in Proc. 3rd Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2006)
[Kolari 07] Kolari, P., Finin, T., and Joshi, A.: Spam in Blogs and
Social Media, in Tutorial at ICWSM (2007)
[Lin 07] Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J., and
Tseng, B. L.: Splog Detection using Self-similarity Analysis on
Blog Temporal Dynamics, in Proc. 3rd AIRWeb, pp. 18 (2007)
[Macdonald 06] Macdonald, C. and Ounis, I.: The TREC Blogs06
Collection : Creating and Analysing a Blog Test Collection, Technical Report TR-2006-224, University of Glasgow, Department of
Computing Science (2006)
[Mishne 05] Mishne, G., Carmel, D., and Lempel, R.: Blocking Blog
Spam with Language Model Disagreement, in Proc. 1st AIRWeb
(2005)
[ 11] , , , , ,
HTML
, 3 DEIM (2011)
[Tong 00] Tong, S. and Koller, D.: Support Vector Machine Active
Learning with Applications to Text Classication, in Proc. 17th
ICML, pp. 9991006 (2000)
[Vapnik 98] Vapnik, V. N.:
Interscience (1998)
4 MinDF [ 10b]