Professional Documents
Culture Documents
Takehito Utsuro
Takayuki Yoshinaka
Yasuhide Kawada
Tomohiro Fukuhara
: (
)
HTML
HTML DOM
SVM
: HTML SVM
Best
Blogwise7
() [2, 3, 4, 5, 6]
Technorati1
Globe of Blogs5
, 305-8573
1-1-1,
Graduate School of Systems and Information Engineering,University of Tsukuba, Tsukuba, 305-8573, Japan
, 101-8457
2-2,
Graduate School of Engineering, Tokyo Denki University, Tokyo,
101-8457, Japan
() , 141-0031 8-3-6,
Navix Co., Ltd., 8-3-6 Nishi-Gotanda, Shinagawa-Ku Tokyo 1410031, Japan
, 277-8568
5-1-5, Research into Artifacts, Center for Engineering, University of Tokyo Kashiwa, Chiba 277-8568, Japan
1 http://technorati.com/
2 http://www.blogpulse.com/
3 http://kizasi.jp/ ()
4 http://blogwatcher.pi.titech.ac.jp/ ()
5 http://www.globeofblogs.com/
[4] 88%
75%[3, 7]
[5]
TREC8 Blog06
/
6 http://www.misohoni.com/bba/
7 http://www.blogwise.com/
8 http://trec.nist.gov/
1: /
(a) /
198
293
277
768
210
259
2849
3318
408
552
3126
4086
(b)
ID
140
26
31
33
[4, 6]
BlogPulse
[8, 9, 10, 4]
HTML
ID
(SVM)
1 C
SVM
ID=1 S
ID= 2, 3, 4
SVM
[13, 14] 2
[12]
HTML
HTML SVM
HTML
3.1
2007 9 2008 2
/
HTML DOM
[15]
HTML DOM
[13, 14] /
1 HTML s
HTML HTML
P DIV
[13, 14]
ID
P DIV
P DIV [15]
BODY P DIV
DOM
BODY HTML
[15]
SCRIPT STYLE
ID (ID=1)
S T S
HTML
(ID=1, C )
ID=1
3.2
S C
T S s
DOM
HTML s t
DOM dm(s) dm(t) DP
DP
C S
1 2 DP
2(a)
C S
st DOM
Rdif f (s, t)
2(b)(c)(d)
S
( 2(b)(c)(d)
)
2(a)(d)
1 HTML DOM
3.3
DOM
HTML
2(a)
S T HTML s S t T
DOM
ID=2
DOM
HTML s S
HTML T t T
Rdif f (s, t) 10
HTML DOM
T Rdif f (s, t T )
10 t
Rdif f (s, t)
(ID=1, C )
(ID=2, 3, 4, S )
9 0
0.15
ID=1
(a) (ID = 1C )
(b) (ID = 2S )
(c) (ID = 3S )
(d) (ID = 4S )
2: DOM
4.2.1
SVM
HTML URL
4.1
/ URL
DOM
URL
i) HTML
3 HTML
URL
DOM
ii) HTML
s 3.3
2 URL
URL u
URL
log
u
1. T
DOM ()
4.2.2
2. T
DOM (
4.2
[16, 17]
/
10
t
t URL
w .
log
f req(,
f req( ,
w)=a
w ) = b
f req(
, w) = c
f req(
, w) = d
w
Ancf B(w, s) Ancf B(w, t)
Ancf W (w, s) 2
2 (, w)
URL
(ad bc)
(a + b)(a + c)(b + d)(c + d)
w
URL
log
2 (, w)
4.2.3
URL
log
URL
/ URL
URL ()
w s
5.1
s w
SVM
(http://chasen.org/~taku/software/TinySVM/)
Ancf B(w, s) =
URL
SVM TinySVM
s w
Ancf W (w, s) Ancf W (w, t)
2 6
2
Ancf W (w, s) =
URL
Ancf B(w, s) 2
5.2
SVM
[12]11
LBDp
LBDn
URL
w
11 (
[19, 12, 20] )
URL
10
(http://chasen-legacy.
sourceforge.jp/) ipadic
(a-1) (C )
(a-2) (C )
(b-1) (S )
(b-2) (S )
3: /
(a-1) (C )
(a-2) (C )
(b-1) (S )
(b-2) (S )
4: /
6.1
2
3 4 (a-1)(a-2)
DOM ()
1(a) C
URL DOM ()
4 (b-1)(b-2)
2 DOM
( 3)
/
() URL
( 4)
URL
/C
DOM ()
408 S
552
3 (a-1) (b-1)
(a-2) (b-2)
140
DOM ()
140 S
(a-2)
90 90
DOM ()
10
6.2
DOM ()
LBDp
DOM ()
LBDp
LBDp
3 4
2 DOM
()+DOM ()
LBDn
DOM ()
LBDn
LBDn
DOM
6.3
() DOM (
3 4
+DOM ()
DOM ()
+DOM
HTML
() DOM
SVM
HTML
SVM
[13] , , , , ,
, .
. DEWS2008
, 2008.
HTML
[14] Y. Sato, T. Utsuro, T. Fukuhara, Y. Kawada, Y. Murakami, H. Nakagawa, and N. Kando. Analysing features of Japanese splogs and characteristics of keywords. In Proc. 4th AIRWeb, pp. 3340, 2008.
DOM
[15] , .
.
, Vol. 8, No. 1, pp. 2934, 2009.
DOM
[16] , , , , ,
.
.
DEIM , 2009.
[2] Z. Gy
ongyi and H. Garcia-Molina. Web spam taxonomy. In Proc. 1st AIRWeb, pp. 3947, 2005.
[3] Wikipedia,
Spam
http://en.wikipedia.org/wiki/Spam blog.
blog.