You are on page 1of 4

The Analysis of Youths’ Searching Behavior

Chao Li, Bin Wu, Yuxiao Li


Beijing Key Laboratory of Intelligent Telecommunications Internet Management and Law Research Center,
Software and Multimedia Beijing University of Posts and Telecommunications
Beijing University of Posts and Telecommunications Beijing, China
Beijing, China

Abstract—In order to create a safe and harmonious cyberspace search engine will not return to the list of uniform resource
environment for youth and protect them from the impact of locator(URL) results, but issue a warning and record the
bad information such as pornography, violence and gambling, search record to research youths’ searching habit, in order to
most of the current solutions are to shield the sites containing protect their health.
this information. However some youth will take the initiative to
seek out this bad information. In this paper we propose a II. RELATED TECHNOLOGY
method to analyze the searching behavior of youth and then to
shield those keywords that may impact their health over time. A. Text Classification and Text Similarity
We suggest that search engine operators provide keywords
filtering technology to limit the searching behavior of youth. Text classification is a process that classifies documents
Operators should return to warning users who search to be processed using the given classification System to
keywords that contain bad information and record the ensure the category that each document belongs to is correct.
searching log as feedback to parents. Of course parents should Text classification is a supervised continuous learning
provide their IP addresses to the search engine operators first, process. It uses the marked corpus to compute the
to allow operators to set their IP addresses within the scope of similarities with unmarked documents so as to find the most
supervising. Relying on search logs we count and classify likely categories that unmarked documents belong to.[2]
keywords to understand the trend of youths’ searching There are several commonly used text classification method
behavior and the areas that they often are interested in. Then such as Naive Bayes, Support Vector Machine(SVM),
we cluster the keywords in each area to get some main clusters maximum entropy, k-nearest neighbors (KNN), ADABoost
and frequent keywords. Using this method we can grasp and so on. ADABoost algorithm is one of the most practical
youths’ searching behavior and protect them from the impact methods of machine learning. This algorithm is an iterative
of bad information effectively. learning method and it can change a weak learning
algorithm into a strong learning algorithm. Even if a single
Keywords- classify; cluster; searching behavior; bad
informations; shield
classifier has a lower recognition rate, there will still be a
higher recognition rate after the combination of multiple
classifiers.
I. INTRODUCTION
Text similarity is an important and basic issue. It is
In the internet era, cyber security is becoming a more widely used in many fields such as text mining, information
and more important issue. According to 26th China Internet retrieval and mechanical translation. There are many models
Development Statistics Report which is published by China to compute text similarity. Vector Space Model (VSM),
Internet Network Information Center (CNNIC), the number which was proposed by Salton, is a classic model. Using this
of China's internet users has surpassed 450 million and has model a document can be represented by a vector that
reached 457 million (73.3 million more than 2009) as of the contains many feature items as its components. Each
end of December 2010. Internet penetration rose to 34.3%. component has a weight computed by a statistical method,
The proportion of people who are 10-19 years old is 27.3%, such as calculating the cosine of the angle between vectors to
which is only less than the proportion of people who are 20- determine the similarity between documents. The key point
29 years old.[1] Youth is a social group with a lower of this method is how to get the feature items of a document
tolerance. Their physical and mental development is to create an appropriate vector to represent the document.
immature which will lead them to become more damaged
than adults when they are exposed to bad information. In B. Text Clustering
this paper we define bad information as information about According to the different features of data, the process
pornography, violence and gambling. This information will dividing the datasets into several different clusters is called
affect youths’ health from a traditional point of view, text clustering. Text clustering is an unsupervised process
because it is contrary to Chinese traditional values of ethic and has become one of the most important branches of data
and morality. Youths’ self-protection and ability to resist the mining. There are many methods to cluster texts such as the
temptation is poor. They may take the initiative to seek out partition-based method, the hierarchical method, the density-
porn, violent and gambling information. There is still no based spatial clustering of applications with noise and so on.
good solution to supervising the search behavior of youth. K-means is a classic text clustering method based on
This paper proposes an approach to supervise youths’ hierarchical method. The basic principle of K-means is as
searching records from the point of view of youths’ follows: Firstly, we choose k documents as the initial k
searching behavior. When there is a bad search request, the clusters. Secondly, according to the average number of

978-0-615-51608-0/11 ©2011 EWI


Authorized licensed use limited to: JESSY J D. Downloaded on February 07,2024 at 17:09:57 UTC from IEEE Xplore. Restrictions apply.
objects in the clusters, we assign each remaining document to ht : X Y
a cluster which is the most similar with the document. Then
we calculate the average of each cluster again. Finally we t Pri Di [ht ( xi ) yi ]
repeat these steps until the objects of each cluster is not
changing any more. Let t (1/ 2) In[(1 t ) / t ] and update the weight of
each sample.
III. FLOW OF SOLUTION
Dt (i )
There are three steps in the process of obtaining and Dt 1 (i) e t
, ht ( xi ) yi
analyzing queries that are recorded by search engine Zt
operators. Each query contains at least time, IP, keyword, Dt (i )
URL. The first two steps prepare input datasets for the third Dt 1 (i) e t , ht ( xi ) yi
Zt
step. The third step analyzes the datasets and get the final
result. where Z t is a normalized parameter to ensure that

A. Get IPs i
Dt 1 (i) 1 .
Parents of youth should provide their IPs address to the
search engine operators if they want to supervise their End for.
children’s searching behavior. 3) The final output is:
T
B. Supervise IPs and Save Queries H ( x) sign t 1
h ( x)
t t
Operators should record all queries from these IPs. For
those bad queries, operators should return warnings instead All categories are defined as below. Taking into
of a list of URL results.
account the reasons for youth searching behavior, we have
C. Analyze Queries defined eleven categories. The classifier can assign the
There are 3 steps to analyze queries. Firstly we segment keywords of each query a category.
the keyword of each query. Secondly we classify these
keywords into several categories. Each category has been TABLE I. DEFINITION OF CATEGORIES
defined before they were classified. Finally in each category Category ID Category
we cluster keywords that are contained in it to get several
clusters. Each cluster represents an aspect that concerns 1 pornography
youth. 2 violence

3 gambling
D. Classify keywords
In this paper we choose Naive Bayes as the basic 4 crime
classifier. We use ADABoost algorithm to combine multiple 5 science
basic classifiers which have different weights into a stronger
6 game
classifier. AdaBoost algorithm is described as below:
7 film
Assume that X represents the sample space, Y represents
8 entertainment
the category set of sample and S represents the training set.
Here we limit the range of Y is binary. 9 military

Y { 1, 1} 10 sport

11 other
S {( xi , yi ) | i 1, 2,..., m} xi X , yi Y
1) Initialize the weights of m samples, and assume the
distribution of Dt is a uniform distribution. Then
E. Cluster keywords
Dt (i) 1/ m After classifying keywords we cluster themin each
where Dt (i) represents the weight of sample ( xi , yi ) in category respectively to find some clusters. Because the
keywords of queries are short messages and their features are
the tth iteration. Let T represent the frequency of iterations. not obvious we cluster keywords twice to find clusters.
2) for t=1 to T do The first clustering (rough):We use Density-Based
According to the distribution of a sample Dt , sample the Spatial Clustering of Applications with Noise(DBSCAN)
algorithm[3] to cluster keywords and get some clusters which
training set S with replacement to generate the training set St . is called a barrel. Each barrel has a label that represents the
most frequent word that appears in it. Then we merge the
Then train the classifier ht in the training dataset St .
similar barrels into one barrel and update the label. Assuming
Use ht to classify all the samples in the training set S to that there are two barrels A and B, then the similarity
get the current classifier(have a certain degree of error):

Authorized licensed use limited to: JESSY J D. Downloaded on February 07,2024 at 17:09:57 UTC from IEEE Xplore. Restrictions apply.
between A and B is computed by the proportion that label A are classified in the category pornography (accuracy 80%).
exists in B and the proportion that label B exists in A. Here we define pornography according to Chinese traditional
values. The result is as below:
The second clustering (refine): For each barrel we use
GN algorithm to find clusters.
TABLE IV. TOP 10 QUERIES(ABOUT PORNOGRAPHY)
GN algorithm is described below[4]:
No. Name count
(1) Compute the betweenness of each node in the
黄色网站
network. Each query represents a node. 1 11963
(pornographic websites)
(2) Find the edge that has the highest betweenness and 色情图片
2 8884
remove it from the network; (pornographic pictures)
(3) Repeat step (1) and step (2) until each node is a 夜总会丑陋一幕
3 5447
degraded community. (an ugly scene in nightclub)
极品公子
4 5316
(rare prince)
IV. EXPERIMENTS
《蜜桃成熟时》
In this paper we use 7.98 million queries (from August
5 (《when the peach is 4999
1,2006 to August 15,2006) from Sougou Lab as the input
dataset to test the algorithms above. The performance of ripe》)
algorithms is below: 女孩无耻一幕
6 4807
(shameless scene of girls)
TABLE II. PERFORMANCE OF ALGORITHMS 龙戏九凤
7 (one dragon VS nine 2787
algorithm Precision(%) Recall(%) phoenixex)
classify 80 90 AV 视频
8 2405
cluster 90 85 (aldult video)
情色网站
9 2293
(erotic websites)
After classifying the dataset we get the most frequent top
10 queries. These queries represent the issues that the 成人小说
10 1808
netizens often are concerned about within a certain time. (adult novel)
Here we define pornography, violence and gambling by the
traditional standard of Chinese. From TABLE IV we can easily see the most frequent
queries that often are searched for by the netizens are
TABLE III. TOP 10 QUERIES pornography.
No. Name count category The experiments demonstrate that using the method
陋俗
proposed in this paper we can classify and cluster queries
1 117048 pornography effectively. And this method can apply to the queries of a
(vulgar custom)
certain field such as youths’ searching behavior. Using this
女艺人
2 30837 entertainment method we analyze youths’ searching behaviors and can
(actress)
grasp their searching habits effectively.
百度
3 29636 science
(Baidu)
V. CONCLUSION
扫黄图片
4 (pictures on anti- 13952 pornography This paper proposed a solution about supervising youths’
pornography) searching behavior in order to prevent them from the impact
留学 of unhealthy information. The execution of the first two
5 11566 science
(study aborad) steps of the solution needs the agreement and mutual trust of
游戏外挂 parents and search engine operators. The execution of the
6 10743 entertainment
(online game plugs) third step is based on the first two steps. The first two steps
周公解梦
should prepare the necessary datasets for the third step. After
7 (Zhougong dream 9867 entertainment the execution of the third step we can get a reasonable result
explanation) about youths’ searching behavior and have a general
小游戏
understanding about their areas of concern effectively.
8 8715 game
(mini-game)
魔兽世界
9 8591 game
(World of Warcraft)
免费电影 ACKNOWLEDGMENT
10 8564 film
(free movie) This material is based upon work supported by the
National Natural Science Foundation of China (Grant
From TABLE III we can see the most frequent keywords No.60905025,90924029 and 61074128).
that netizens often seek out. Take the No.1 category
(pornography) for example - we cluster all the queries that REFERENCES

Authorized licensed use limited to: JESSY J D. Downloaded on February 07,2024 at 17:09:57 UTC from IEEE Xplore. Restrictions apply.
[1] China Internet Development Statistics Report[R].Beijing. China [3] Data Mining Concepts and Techniques.Jiawei Han,Micheline
Internet Network Information Center.2011.01. Kamber.2007.
[2] Wu Bin,Zhang Zhonghui,Li Chao,Zhu Tian and Wang [4] Complex Network Theory and Its Applications.Xiaofan Wang,Xiang
Weiduo.Construct and Exploit the Event and Theme Network for Li,Guanrong Chen.2005.
Emergency Manag ement.ISEM’09.227-232(Ch).

Authorized licensed use limited to: JESSY J D. Downloaded on February 07,2024 at 17:09:57 UTC from IEEE Xplore. Restrictions apply.

You might also like