You are on page 1of 8

HTML

Splog Detection using Similarities of HTML Structures



Taichi Katayama


Takehito Utsuro


Takayuki Yoshinaka


Yasuhide Kawada


Tomohiro Fukuhara
: (
)
HTML
HTML DOM

SVM

: HTML SVM

Best

Blogs in Asia Directory

Blogwise7

() [2, 3, 4, 5, 6]

Technorati1

BlogPulse2 kizasi.jp3 blogWatcher 4 [1]

Globe of Blogs5

, 305-8573
1-1-1,
Graduate School of Systems and Information Engineering,University of Tsukuba, Tsukuba, 305-8573, Japan
, 101-8457
2-2,
Graduate School of Engineering, Tokyo Denki University, Tokyo,
101-8457, Japan
() , 141-0031 8-3-6,
Navix Co., Ltd., 8-3-6 Nishi-Gotanda, Shinagawa-Ku Tokyo 1410031, Japan
, 277-8568
5-1-5, Research into Artifacts, Center for Engineering, University of Tokyo Kashiwa, Chiba 277-8568, Japan
1 http://technorati.com/
2 http://www.blogpulse.com/
3 http://kizasi.jp/ ()
4 http://blogwatcher.pi.titech.ac.jp/ ()
5 http://www.globeofblogs.com/

[4] 88%

75%[3, 7]

[5]
TREC8 Blog06
/
6 http://www.misohoni.com/bba/
7 http://www.blogwise.com/
8 http://trec.nist.gov/

1: /
(a) /

198

293

277

768

210

259

2849

3318

408

552

3126

4086

(b)

ID

140

26

31

33

[4, 6]

BlogPulse
[8, 9, 10, 4]

1: HTML DOM DOM

HTML

Support Vector Machines [11]

ID

(SVM)

1 C

SVM

ID=1 S
ID= 2, 3, 4

SVM

[13, 14] 2

[12]

HTML

HTML SVM

HTML

3.1

2007 9 2008 2
/

HTML DOM

[15]

HTML DOM

[13, 14] /

1 HTML s

HTML HTML

P DIV

[13, 14]

ID

P DIV

P DIV [15]

BODY P DIV

DOM

BODY HTML

[15]

SCRIPT STYLE

ID (ID=1)

S T S

s AvM inDF10 (s, T )

HTML

(ID=1, C )

ID=1

HTML s DOM dm(s)

3.2

S C
T S s

DOM

AvM inDF10 (s, T )

HTML s t
DOM dm(s) dm(t) DP

DP

C S

1 2 DP

2(a)

edit distance (dm(s), dm(t))

C S

st DOM

T S s AvM inDF10 (s, T )

Rdif f (s, t)

2(b)(c)(d)
S

edit distance (dm(s), dm(t))


Rdif f (s, t) =
|dm(s)| + |dm(t)|

( 2(b)(c)(d)
)
2(a)(d)

1 HTML DOM

3.3

DOM

HTML

2(a)

S T HTML s S t T

AvM inDF10 (s, T ) 0 0.15

DOM

ID=2

AvM inDF10 (s, T ) 0.5

DOM

HTML s S

HTML T t T
Rdif f (s, t) 10

AvM inDF10 (s, T )


ID

AvM inDF10 (s, T )

AvM inDF10 (s, T )

HTML DOM

T Rdif f (s, t T )

10 t

Rdif f (s, t)

(ID=1, C )
(ID=2, 3, 4, S )

9 0
0.15
ID=1

(a) (ID = 1C )

(b) (ID = 2S )

(c) (ID = 3S )

(d) (ID = 4S )

2: DOM

4.2.1

SVM

HTML URL

4.1

/ URL

DOM

URL

i) HTML

3 HTML

URL

DOM

ii) HTML

s 3.3

2 URL

AvM inDF10 (s, T )


log AvM inDF10 (s, T )

URL u

URL

log
u

1. T
DOM ()

4.2.2

2. T

[18, 13, 14]

DOM (

4.2

[16, 17]

/
10

t
t URL

w .

log

f req(,

f req( ,

w)=a

w ) = b

f req(
, w) = c

f req(
, w) = d


w


Ancf B(w, s) Ancf B(w, t)


Ancf W (w, s) 2

2 (, w)

URL

(ad bc)
(a + b)(a + c)(b + d)(c + d)

w
URL

log


2 (, w)

4.2.3

URL



log

URL

/ URL

URL ()

w s

5.1

Ancf B(w, s) Ancf W (w, s)

s w

SVM

(http://chasen.org/~taku/software/TinySVM/)

Ancf B(w, s) =

URL

SVM TinySVM

s w


Ancf W (w, s) Ancf W (w, t)

2 6
2

Ancf W (w, s) =

URL

Ancf B(w, s) 2

5.2

SVM
[12]11

LBDp
LBDn

URL
w
11 (
[19, 12, 20] )

URL
10
(http://chasen-legacy.
sourceforge.jp/) ipadic

(a-1) (C )

(a-2) (C )

(b-1) (S )

(b-2) (S )

3: /

(a-1) (C )

(a-2) (C )

(b-1) (S )

(b-2) (S )

4: /

6.1

2
3 4 (a-1)(a-2)

DOM ()

1(a) C

URL DOM ()

4 (b-1)(b-2)
2 DOM

( 3)
/

() URL

( 4)

URL

/C

DOM ()

408 S

552

3 (a-1) (b-1)

(a-2) (b-2)

140

DOM ()

140 S

(a-2)

90 90

DOM ()

10

6.2

DOM ()

LBDp

DOM ()

LBDp

LBDp

3 4
2 DOM

()+DOM ()

LBDn

DOM ()

LBDn

LBDn

DOM

6.3

() DOM (

3 4

+DOM ()

DOM ()

+DOM

HTML

() DOM

SVM

HTML

[12] S. Tong and D. Koller. Support vector machine active


learning with applications to text classication. In
Proc. 17th ICML, pp. 9991006, 2000.

SVM

[13] , , , , ,
, .
. DEWS2008
, 2008.

HTML

[14] Y. Sato, T. Utsuro, T. Fukuhara, Y. Kawada, Y. Murakami, H. Nakagawa, and N. Kando. Analysing features of Japanese splogs and characteristics of keywords. In Proc. 4th AIRWeb, pp. 3340, 2008.

DOM

[15] , .
.
, Vol. 8, No. 1, pp. 2934, 2009.

DOM

[16] , , , , ,
.
.
DEIM , 2009.

[17] T. Katayama, Y. Sato, T. Utsuro, T. Yoshinaka,


Y. Kawada, and T. Fukuhara. An empirical study
on selective sampling in active learning for splog detection. In Proc. 5th AIRWeb, pp. 2936, April 2009.

[1] T. Nanno, T. Fujiki, Y. Suzuki, and M. Okumura.


Automatically collecting, monitoring, and mining
Japanese weblogs. In WWW Alt. 04: Proc. 13th
WWW Conf. Alternate Track Papers & Posters, pp.
320321, 2004.

[18] Y.M. Wang, M. Ma, Y. Niu, and H. Chen. Spam


double-funnel: Connecting web spammers with advertisers,. In Proc. 16th WWW, pp. 291300, 2007.
[19] D. D. Lewis and W. A. Gale. A sequential algorithm
for training text classiers. In Proc. 17th SIGIR, pp.
312, 1994.

[2] Z. Gy
ongyi and H. Garcia-Molina. Web spam taxonomy. In Proc. 1st AIRWeb, pp. 3947, 2005.
[3] Wikipedia,
Spam
http://en.wikipedia.org/wiki/Spam blog.

[20] G. Schohn and D. Cohn. Less is more: Active learning


with support vector machines. In Proc. 17th ICML,
pp. 839846, 2000.

blog.

[4] P. Kolari, A. Joshi, and T. Finin. Characterizing the


splogosphere. In Proc. 3rd Ann. Workshop on the
Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006.
[5] C. Macdonald and I. Ounis. The TREC Blogs06 collection : Creating and analysing a blog test collection.
Technical Report TR-2006-224, University of Glasgow,
Department of Computing Science, 2006.
[6] P. Kolari, T. Finin, and A. Joshi. Spam in blogs and
social media. In Tutorial at ICWSM, 2007.
[7] Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and
B. L. Tseng. Splog detection using self-similarity analysis on blog temporal dynamics. In Proc. 3rd AIRWeb,
pp. 18, 2007.
[8] . .
Letters, Vol. 6, No. 4, pp. 3740, 2008.
[9] .
. Web
(WebDB Forum)2008 .
, 2008.
[10] P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog identication and Splog detection. In
Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing Weblogs, pp. 9299, 2006.
[11] V. N. Vapnik. Statistical Learning Theory. WileyInterscience, 1998.

You might also like