You are on page 1of 8

DEWS2008 A10-2

305-8573 1-1-1
277-8568 5-1-5
() 141-0031 8-3-6
113-0033 7-3-1
101-8430 2-1-2

Collecting/Classifying Splogs and Developing Splog Data Set


based on Time Series Characteristics of Keywords
Yuuki SATO , Takehito UTSURO , Tomohiro FUKUHARA , Yasuhide KAWADA ,
Yoshiaki MURAKAMI , Hiroshi NAKAGAWA , and Noriko KANDO
Grad. Sch. Systems and Information Engineering, University of Tsukuba, Tsukuba, 305-8573, Japan
Research into Artifacts, Center for Engineering, University of Tokyo Kashiwa, Chiba 277-8568, Japan
Navix Co., Ltd. 8-3-6 Nishi-Gotanda, Shinagawa-Ku Tokyo 141-0031, Japan
Information Technology Center, University of Tokyo 7-3-1, Hongou, Bunkyo, Tokyo 113-0033, Japan
National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
Abstract This paper focuses on analyzing (Japanese) splogs based on various characteristics of keywords contained in them. We estimate the behavior of spammers when creating splogs from other sources by analyzing the
characteristics of keywords contained in splogs. Since splogs often cause noises in word occurrence statistics in the
blogosphere, we assume that we can eciently collect splogs by sampling blog homepages containing keywords of a
certain type on the date with its most frequent occurrence. We manually examine various features of collected blog
homepages regarding whether their text content is excerpt from other sources or not, as well as whether they display
aliate advertisement or out-going links to aliated sites. Among various informative results of our analysis, it is
important to note that more than half of the collected splogs are created by a very small number of spammers.
Key words blog, spam blog, aliate

1.

Technorati 1BlogPulse 2kizasi.jp 3

blogWatcher 4[1]

2.

Globe of Blogs 5

Best Blogs in Asia Directory 6

Blogwise 7

[2][6]

[4] 88%
75%

i)

ii)

[3], [7]

[4][6]

[5] TREC

Blog06

[4], [6] BlogPulse

adsense 9

[4], [8], [9]

3.
1
3

i)
ii)

iii)

3. 1

[4], [6]

[10]

i)
1
http://technorati.com/
2
http://www.blogpulse.com/

ii)

3
http://kizasi.jp/ ()

iii) 10

4
http://blogwatcher.pi.titech.ac.jp/ ()

iv)

5
http://www.globeofblogs.com/
6
http://www.misohoni.com/bba/

9http://google.com/adsense

7
http://www.blogwise.com/

10

8
http://trec.nist.gov/

(%)

80.5
31.0

8.1

42.1

14.3

70.8

27.1

2.9

[11]

3.6

12.7

[6]

SEO

11.5

[11]

4.5

49.5
36.9

v) [11]

3. 2

5. 1

2.

a)

i)
ii)
iii)

b)

i) ii)iii) 6.

iv)

4. 2

v) [11]

( i) iii) )

3. 3

i)

4.
4. 1

ii)

iii)

(a)

iv) [6]

(b)

(a)
1

(b)

2 (2007 12 3 0:00 )

3,591,306 192,699,276

1,355

196,975

4. 2

(3-a)

adsense

(3-b)

5. 2

[13]

2 50

RSS 12

Atom

50

Juman 13

50

11

5.
5. 1

2
(2007 3 ) 2004

360 1 9300

5. 3

5. 1

11
[12]

12RDF Site SummaryReally Simple Syndication

Rich Site Summary


13http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html

3.

1 2 50
URL 2007

2 URL 50
60URL
110 URL 50URL 1 3
60URL 1
2
3 URL

a URL

i.
ii.

b URL .
5

6.
4. 2 50 4 22
22

6. 1
3 , 88% 3
2
50%

14

4 URL

14 [12]
Doorway Doorway

S C

J A L G Y

192

142

54

24

26

442

203

115

169

355

128

130

207

396

1703

395

257

223

379

131

131

207

422

2145

48.6

55.3

24.2

6.3

2.3

0.8

0.0

6.2

20.6

(%)

ID

( 1 )

115 (42.3%)

ZARD
Wii
2

56

(20.6%)

30

(11.0%)

()

26

(9.6%)

()

20

(7.4%)

(
)

10

(3.7%)

(2.5%)

(1.5%)

(0.7%)

10

(0.7%)

272

1
10%

442
2 10
442 272(61.5%)
10
10
3

6. 2

22 5

5 22

30%30 10%10% 3

2
4

(1) 30% 5 4

5 (
50%)

(%)

ID

(%)

(%)

89.2

92.4

2, 6, 8

38.5

88.1

94.8

27.8

58.1

90.2

3, 4

12.0

40.9

18.5

36.1

58.7

5, 7

19.8

37.4

24.4

14.3

1, 10

21.7

22.5

11.1

20.5

22.1

0.0

22.1

19.1

0.0

19.1

15.2

80.0

1, 6

3.4

15.1

0.0

15.1

14.3

14.3

12.2

6.9

71.4

1, 3

2.1

ZARD

4.7

20.0

3.8

4.7

20.0

3.8

2.9

100.0

0.0

Wii

2.8

66.7

1.0

2.8

33.3

1.9

2.0

0.0

2.0

1.8

50.0

0.9

0.0

0.0

0.0

0.0

0.0

0.0

20.5

61.5

1 - 10

9.0

(6)

(2) 4 10%

7.

30%

[4][6]

[7], [9]

(3) 3 (2)

(4) 10% 6

(5) 30% 5
4

[1] T. Nanno, T. Fujiki, Y. Suzuki, and M. Okumura. Automatically collecting, monitoring, and mining Japanese weblogs.
In WWW Alt. 04: Proceedings of the 13th international
World Wide Web conference on Alternate track papers &
posters, pp. 320321. ACM Press, 2004.
[2] Z. Gy
ongyi and H. Garcia-Molina. Web spam taxonomy. In
AIRWeb 05: Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, pp.
3947, 2005.
[3] Wikipedia, Spam blog. http://en.wikipedia.org/wiki/
Spam blog.

[4] P. Kolari, A. Joshi, and T. Finin. Characterizing the splogosphere. In Proceedings of WWW 2006 3rd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis
and Dynamics, 2006.
[5] C. Macdonald and I. Ounis. The TREC Blogs06 collection
: Creating and analysing a blog test collection. Technical
Report TR-2006-224, University of Glasgow, Department of
Computing Science, 2006.
[6] P. Kolari, T. Finin, and A. Joshi. Spam in blogs and social
media. In Tutorial at ICWSM, 2007.
[7] Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L.
Tseng. Splog detection using self-similarity analysis on blog
temporal dynamics. In AIRWeb 07: Proceedings of the 3rd
International Workshop on Adversarial Information Retrieval on the Web, pp. 18, 2007.
[8] . .
Web (DBWeb2007)
. , 2007.
[9] P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere:
Blog identication and Splog detection. In Proceedings of
the 2006 AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, pp. 9299, 2006.
[10] Y. Sato, T. Utsuro, T. Fukuhara, Y. Kawada, Y. Murakami,
H. Nakagawa, and N. Kando. Collecting and analyzing
Japanese splogs based on characteristics of keywords. In
Proceedings of ICWSM, 2008.
[11] Wikipedia, Word salad (computer science). http://en.
wikipedia.org/wiki/Word salad %28computer science%29.
[12] Y.M. Wang, M. Ma, Y. Niu, and H. Chen. Spam doublefunnel: Connecting web spammers with advertisers,. In Proceedings of the 16th WWW Conference, pp. 291300, 2007.
[13] , , .
.

13 Web
, pp. 4043, 2007.

You might also like