You are on page 1of 6

中国科技论文在线 http://www.paper.edu.

cn

SimRank: A Page Rank Approach based on


Similarity Measure∗
Shaojie Qiao†, Tianrui Li, Hong Li and Yan Zhu Jing Peng† Jiangtao Qiu
School of Information Science and Technology Science and Technology Department School of Information
Southwest Jiaotong University Chengdu Municipal Public Southwestern University of
Chengdu 610031, China Security Bureau Finance and Economics
Email: sjqiao@swjtu.edu.cn Chengdu 610017, China Chengdu, 610074, China

Abstract—As the Web contains rich and convenient informa- As for the PageRank algorith, support there is a web page A
tion, Web search engine is increasingly becoming the dominant which has an out-link to page B, it works as: treat B’s in-links
information retrieving approach. In order to rank the query as A’s votes and employ the vote to estimate the importance of
results of web pages in an effective and efficient fashion, we
propose a new page rank algorithm based on similarity measure A, and the more popular a page is, the more possible it is voted
from the vector space model, called SimRank, to score web pages. by other pages. PageRank has the following drawbacks [3].
Firstly, we propose a new similarity measure to compute the 1) The PageRank algorithm works in an offline fashion,
similarity of pages and apply it to partition a web database
into several web social networks (WSNs). Secondly, we improve
so the scores of pages depend on the previously given
the traditional PageRank algorithm by taking into account the pages. Actually, the Web will accumulate a large volume
relevance of page to a given query. Thirdly, we design an efficient of new pages that contain high quality information
web crawler to download the web data. And finally, we perform during a short period of time. These newly published
experimental studies to evaluate the time efficiency and scoring pages with high authority will be frequently cited by
accuracy of SimRank with other approaches.
Index Terms—information retrieval; page rank; similarity
other pages or web sites, which will increase their page
measure; web social network. rank values.
2) The traditional PageRank algorithms are biased to com-
I. I NTRODUCTION pute the scores among the type of URLs ending with
“.com”, since such pages are often portals or large sites
As the Web applications becomes prevalent and ubiquitous, which can easily capture more links or citations than
the growing pages form distinct web social networks (WSNs some professional ones. However, these pages which
for short), which are becoming intricate. How to effectively provide special services for certain users are sometimes
and efficiently discover valuable information from the big authoritative and have lots of in-links from other pages.
ocean of WSNs emerges to be a challenging research topic. 3) PageRank cannot identify whether the hyperlinks of
Wherein, ranking (scoring) web pages is very difficult due to distinct pages are content-correlated (i.e., the themes
the following reasons: (1) the pages grow in a fashion of Ter- are similar among pages) or not, which will cause the
abyte or even Petabyte each day which needs a large amount phenomena of “theme draft”.
of time to process these pages, and (2) the link relations among
these pages are very intricate which can affect the accuracy This paper mainly aims to find an appropriate method to
of page scoring. HITS[1] and PageRank [2] are two widely analyze the structure of WSNs contain a large amount of
used page rank approaches. Whereafter, researchers proposed complex link relations among pages. To achieve this goal, we
a series of improved PageRank algorithm due to its advantages make the following contributions in this study.
beyond other approaches in page scoring, that is, PageRank 1) Since the traditional PageRank algorithm may cause the
takes into account the link structure of distinct pages and have phenomena of “theme draft”, we used the similarity
the capability of resisting the mendacious links. measure derived from the vector space model [4] to
compute the similarity between pages based on terms,
∗ This work is partially partially supported by the National Science Foun- and applied it to partition the Web into distinct WSNs.
dation for Post-doctoral Scientists of China under Grant No.20090461346;
the Fundamental Research Funds for the Central Universities under Grant
2) We proposed a weighted page rank algorithm, namely
No.SWJTU09CX035; the Sichuan Youth Science and Technology Foun- SimRank, which considers the relevance of a page to the
dation of China under Grant No.08ZQ026-016; Sichuan Science and given query which can improve the accuracy of scoring.
Technology Support Projects under Grant No.2010GZ0123; Innovative
Application Projects of the Ministry of Public Security under Grant
3) We developed an efficient web crawler to obtain data
No.2009YYCXSCST083; the Education Ministry Youth Fund of Humanities from the Web, which can filter the useless pages as well.
and Social Science of China under Grant No.09YJCZH101; Major Special 4) We conducted experiments to estimate the accuracy and
Science and Technology Projects-Significant Creation of New Drugs under
Grant No.2009ZX09313-024.
efficiency of the SimPank algorithm and the clustering
† Corresponding author, Email: sjqiao@swjtu.edu.cn, pj@tfol.com approach applied to SimRank.

___________________________________
978-1-4244-6792-1/10/$26.00 ©2010 IEEE 转载
中国科技论文在线 http://www.paper.edu.cn

II. R ELATED W ORK structure analysis has recently attracted increasing attention.
The Web has a particular link structure, i.e., hyperlinks which
With the rapid growth and widespread of the Web, there are used to represent the relations among pages, and can be
are more and more users including the young, middle-age, classified into out-links and in-links. Based on the special
and elder people from a variety of occupations and fields structure of pages, we can find useful knowledge from web
suffering the Internet. Recently, there emerge several social pages, including the importance of each page, and the topology
networks in the Web. It is of great practical use to analyze the structure of WSNs. There are three commonly used centrality
structure and evolving trend of WSNs, since it can help the measures that can be used to perform web social network
website managers to classify their users by their interactions analysis, that is, degree centrality, betweenness centrality and
and provide special services for distinct groups of people. closeness centrality [12].
PageRank has attracted a lot of attention since it was firstly Prestige is a more refined measure of the importance or
proposed by Lawrence Page [5], which is used to compute prominence of an actor than centrality [4]. The main difference
the importance of pages. It is based on the idea that: the between centrality and prestige is that centrality treats out-
pages that are linked by pages with high authority are often links as an important factor, while prestige focuses on the
treated as high quality pages. However, PageRank can cause impact of in-links. Here we introduce three prestige measure-
the phenomena of “subside” which will affect the correctness ments in the following [2], [4].
of scoring. When the subside occurs, a group of pages are
∙ Degree prestige: an actor is very prestigious if it obtains
connected with each other without out-links to other pages.
several in-links or nominations.
Thus, once there are in-links from outside pages to pages in
∙ Proximity prestige: The degree prestige of an actor only
this group, the PageRank score from out-links will stay in this
considers the actors that are adjacent to it. The proximity
circle of pages and cannot be transferred to other pages.
prestige generalizes it by considering both the actors
In order to overcome Drawback (1) addressed in Section I, directly and indirectly linked to it.
Zhang et al. [6] proposed an new PageRank algorithm in order ∙ Rank prestige: rank prestige is based on the truth that:
to accelerate the scoring of web pages. On the contrary, due one’s prestige is affected by the ranks of the involved
to the dated web pages, the efficiency of page scoring will actors. It forms the basis of most web link analysis
fall down quickly. In order to handle the problem existing algorithms, including PageRank and HITS.
in Drawback (3), Haveliwala [7] proposed a topic-sensitive
PageRank algorithm which can handle the situation that some Information Retrieval (IR) [13], [14] is an essential tech-
pages may not be considered to be important in other fields nique that are widely used in web content mining especially for
although they get a high score in some field. This approach Web search engine. IR model defines how to express query and
contains three key phases: (1) for each page, compute the rank document, similarity between query and document. Basically,
value of pages based on sixteen basic topic vectors from the there are three commonly used IR models: boolean model,
Open Directory [8], (2) compute the similarity between the vector space model, and language model. Particularly, vector
query from users and the given topic vectors, and finally (3) space model is the best known IR model [4].
return the topic that are most approximate to the query. This
III. P ROBLEM S TATEMENT AND P RELIMINARIES
method can help avoid the problem of theme draft. Matthew
Richardson and Pedro Dominggos found that [9]: when the In this section, we will firstly give a new concept called
user surfs the Internet from one page to another one, its web social network, and then formally define and analyze the
behavior will be greatly impacted by the content of the current working mechanism of the PageRank algorithm.
reviewed page and the queries. Consequently, they proposed Definition 1 (web social network): Let ℵ be a web social
an new PageRank algorithm based on URLs and the content network, ℵ is defined as a graph ℵ = (𝑉, 𝐸), where 𝑉 is a set
of pages. of web pages and 𝐸 is a set of directed edges (𝑝, 𝑞, 𝑤), where
In recent years, social network analysis (SNA) has been 𝑝, 𝑞 ∈ 𝑉 , and 𝑤 is a weight between 𝑝 and 𝑞. The pages in
recognized as a promising technology for studying complex ℵ should satisfy that there is at least one in or out link from
networks. In terms of the SNA methodology, there are three one page to another, i.e., the nodes in ℵ directly (indirectly)
essential techniques including relational analysis, positional connected to others.
analysis, and hierarchical clustering [10]. These centrality Before introducing the improved PageRank algorithm, we
measures are particularly appropriate to analyze the impor- have to give the formula by which to compute the score of
tance of members and can be applied to WSNs for scoring each page as below.
pages as well. We previously proposed a centrality measure
based approach by standardizing these three degrees and used ∑ 𝑃 𝑎𝑔𝑒𝑅𝑎𝑛𝑘(𝑞)
it to identify the central players in criminal networks [11]. 𝑃 𝑎𝑔𝑒𝑅𝑎𝑛𝑘(𝑝) = 1−𝑑+𝑑× (1)
𝑁 (𝑞)
𝑎𝑙𝑙 𝑞 𝑙𝑖𝑛𝑘 𝑡𝑜 𝑝
Web is a specific social network, where each page can be
treated as an actor. The study of web social network analysis where 𝑃 𝑎𝑔𝑒𝑅𝑎𝑛𝑘(⋅) is the scoring function, 𝑝 represents a
can be categorized into three directions: (1) link structure web page, 𝑑 is a dampening factor between 0 to 1, which
analysis, (2) content mining, and (3) log mining. Web link guarantees that the surfer can browse each page at a nonzero
中国科技论文在线 http://www.paper.edu.cn

probability value and is set to 0.85 that is proved to be a TF-IDF weight of each term, and employ the improve method
reasonable value by Qin in [15], 𝑞 is any page linking to 𝑝, to determine how important a term is in a page.
and 𝑁 (𝑞) is the number of out-links from 𝑞. Intuitively, the
rank value of a page is obtained by iteratively calculating the { }
0.5×𝑓𝑖𝑗 𝑁
above formula. 𝑤𝑖𝑗 = 0.5 + ×log (5)
𝑚𝑎𝑥{𝑓1𝑗 , 𝑓2𝑗 , ..., 𝑓∣𝑉 ∣𝑗 } 𝑑𝑓𝑖
By combining the basic idea of PageRank, we give the
following statements. By Equation 5, we can see that if a term 𝑡𝑖 appears in every
1) A page is definitely important if a lot of pages link to document, then 𝑁 = 𝑑𝑓𝑖 and 𝑤𝑖𝑗 = 0, which means that 𝑡𝑖
it or some prestigious pages connect with it. has no way to any page in a WSN. However, in real world
2) The behavior of browsing web pages is a stochastic situations, even a term appears in each page, the topic of pages
process [4]. Generally, suffers browse a web page at can still be different due to the impact of other important or
a particular probability based on his likeness. prestige terms. So, we improve the above formula to be:
{ }
IV. A N N EW PAGE R ANK A LGORITHM BASED ON 0.5×𝑓𝑖𝑗 𝑁 +1
S IMILARITY M EASURE 𝑤𝑖𝑗 = 0.5 + ×log (6)
𝑚𝑎𝑥{𝑓1𝑗 , 𝑓2𝑗 , ..., 𝑓∣𝑉 ∣𝑗 } 𝑑𝑓𝑖
Normally, traditional PageRank algorithms do not take into
The left part of Equation 6 (log 𝑁𝑑𝑓+1 ) can guarantee that
consideration the impact of the content (text) of each page, 𝑖
𝑤𝑖𝑗 ∕= 0 even 𝑡𝑖 appears in each document. Generally, we
and only employ the link relations among pages to compute
have to judge whether two pages are similar and how similar
the rank of each page. Actually, the accuracy of page scoring
they are, which can help effectively rank web pages in a WSN,
greatly depends on the content which is the target information
since the actors in a WSN often care similar topics.
the surfers are interested in and looking for. This requirement
motivates us to borrow the similarity measure from vector For a query 𝑄 = (𝑡1 , 𝑡2 , ..., 𝑡𝑚 ), a page vector 𝑝𝑗 could
be denoted as (𝑤1𝑗 , 𝑤2𝑗 , ..., 𝑤𝑚𝑗 ) to show the relevance to
space model and use it to score pages, i.e., a page in vector
this query, where 𝑚 is the number of terms in query 𝑄.
space model is represented as a weight vector, in which each
We proposed the following similarity measure to compute the
component weight is computed based on some variation of TF
similarity between pages 𝑝𝑎 and 𝑝𝑏 .
(Term Frequency) or TF-IDF (Inverse Document Frequency)
scheme as follows [4].
Definition 2 (TF-Scheme): In TF scheme, the weight of a 𝑑𝑎 ∙𝑑𝑏
𝑠𝑖𝑚(𝑝𝑎 , 𝑝𝑏 ) =
term 𝑡𝑖 in page 𝑑𝑗 is the number of times that 𝑡𝑖 appears ∣∣𝑑𝑎 ∣∣2 + ∣∣𝑑𝑏 ∣∣2 − 𝑑𝑎 ∙𝑑𝑏
∑𝑚
in document 𝑑𝑗 , denoted by 𝑓𝑖𝑗 . The following normalization 𝑤𝑖𝑝𝑎 × 𝑤𝑖𝑝𝑏
= ∑𝑚 2
∑𝑖=1
𝑚 2
∑𝑚 (7)
𝑖=1 𝑤𝑖𝑝𝑎 + 𝑖=1 𝑤𝑖𝑝𝑏 − 𝑖=1 𝑤𝑖𝑝𝑎 × 𝑤𝑖𝑝𝑏
approach is applied [4]:
𝑓𝑖𝑗 To facilitate understanding of Equation 7, we give the
𝑡𝑓𝑖𝑗 = (2)
𝑚𝑎𝑥{𝑓1𝑗 , 𝑓2𝑗 , ..., 𝑓∣𝑉 ∣𝑗 } following example.
Support there are two pages and their contents are shown
where 𝑓𝑖𝑗 is the frequency count of term 𝑡𝑖 in page 𝑑𝑗 , and
as follows:
∣𝑉 ∣ is the vocabulary size of the collection. If term 𝑡𝑖 does not
𝑝1 =After main earthquake on May, another violent earth-
appear in 𝑑𝑗 , then 𝑡𝑓𝑖𝑗 = 0,
quake happens again.
The disadvantage of TF scheme is that it does not consider
𝑝2 =Natural disasters include earthquake, volcanic eruptions,
the case that a term appears in several pages, which limits its
landslides, tsunamis, floods, and drought.
application.
And, the query 𝑄={𝑡1 }={earthquake}.
Definition 3 (TF-IDF Scheme): Let 𝑁 be the total number
Based on Equations (2)-(7), we can obtain:
of pages in a web database, 𝑑𝑓𝑖 be the number of pages in
which term 𝑡𝑖 appears at least once, and 𝑓𝑖𝑗 be the frequency 2 1
𝑡𝑓𝑝1 = = 0.2, 𝑡𝑓𝑝2 = = 0.091
count of term 𝑡𝑖 in page 𝑑𝑗 . The inverse document frequency 10 11

(denoted by 𝑖𝑑𝑓𝑖 ) of term 𝑡𝑖 is computed by [4]: 3


𝑖𝑑𝑓𝑝1 = log 2 = 0.176, 𝑖𝑑𝑓𝑝2 = log 32 = 0.176
𝑁
𝑖𝑑𝑓𝑖 = log (3) 𝑤𝑞𝑝1 = (0.5 + 0.5 ∗ 0.2) ∗ 0.176 = 0.106
𝑑𝑓𝑖
The term weight is computed by:
𝑤𝑞𝑝1 = (0.5 + 0.5 ∗ 0.09) ∗ 0.176 = 0.096
𝑤𝑖𝑗 = 𝑡𝑓𝑖𝑗 × 𝑖𝑑𝑓𝑖 (4) 0.106∗0.096
∴ 𝑠𝑖𝑚(𝑝1 , 𝑝2 ) = 0.1062 +0.0962 −0.106∗0.096 = 0.099
Note that the TF-IDF scheme is based on the intuition that if
a term appears in several pages, it is probably not important or Based on the similarity measure, we proposed a new page
not discriminative [4]. In this study, we improve the following rank algorithm, called SimRank. It contains two main phases:
formula proposed by Salton and Buckley [16] to compute the (1) apply the similarity measure to the k-means algorithm to
中国科技论文在线 http://www.paper.edu.cn

partition the crawled web pages into distinct WSNs, and (2) Algorithm 1 has the following advantages: (1) it uses k-
use an improved PageRank algorithm in which two distinct means clustering approach to divide a web database into
weight values are assigned to the title and body of a page, several WSNs as well as clean up the unrelated pages that
respectively. The detail is given in Algorithm 1. can help reduce the cost of computation, (2) the proposed
similarity measure can effectively cluster and score the pages,
and (3) the proposed weighted page rank algorithm has been
Algorithm 1 SimRank: A Page Rank Approach Based on evaluated to be an effective and feasible approach.
Similarity Measure
Input: A web database 𝔻 = {𝑝1 , 𝑝2 , ..., 𝑝𝑛 } containing 𝑛 V. W EB C RAWLER I NTRODUCTION
pages, a query 𝑄, and the number of clusters 𝑘.
Output: A list of page score 𝑠𝑐𝑜𝑟𝑒𝑖𝑗 in each clustered WSN In order to construct a web social network, we have to
𝑁𝑖 . crawl the pages from the Web. How to develop an efficient
1. for (𝑖 = 0; 𝑖 < 𝑛; 𝑖 + +) do and effective web crawler is a challenging problem due to
2. for (𝑗 = 0; 𝑗 < 𝑛; 𝑗 + +) do the following reasons: (1) there are a large amount of useless
3. 𝑠𝑖𝑚[𝑖][𝑗] = 𝑠𝑖𝑚(𝑝𝑖 , 𝑝𝑗 ); URLs (e.g., the advertising link) in the Web, we have to
4. Arbitrarily choose 𝑘 pages as the initial cluster centers; clear out these unnecessary pages which can help improve the
5. repeat efficiency of crawling as well as save the storage space; (2) a
6. for each page 𝑝 ∈ 𝔻 do crawler should provide the capability of storing the relations
7. Assign 𝑝 to a closest center based on the mean value among distinct pages, i.e., the “out-link” or “in-link” relations;
of the objects in the cluster; (3) the phase of crawling should terminate when it reaches
8. Recalculate the centroid of the current cluster by the to the pages deep enough, e.g., specifying a proper crawling
given similarity between pages; depth value, or a time constraint.
9. until no change; In this study, we develop an efficient and powerful web
10. for each page 𝑝𝑗 ∈ 𝔻 do crawler to obtain data from the Web, and store useful and
𝑡𝑖𝑡𝑙𝑒 𝑏𝑜𝑑𝑦 valuable pages into a web database. Each data item contains
11. 𝑤𝑖𝑗 = 0.7 ∗ 𝑤𝑖𝑗 + 0.3 ∗ 𝑤𝑖𝑗 ;
12. 𝑆𝑖𝑚𝑅𝑎𝑛𝑘(𝑝𝑗 ) = 𝑤𝑖𝑗 the attributes with the page number, URL, title, body, and
13. repeat the page list linked to this page. The following figure is the
14. for each clustered 𝑁𝑙 ∈ 𝔻 do interface of our proposed web crawler.
15. for each page 𝑝 ∈ 𝑁𝑙 do ∑
16. 𝑆𝑖𝑚𝑅𝑎𝑛𝑘(𝑝) = 1 − 𝑑 + 𝑑× 𝑘 𝑤𝑖𝑘 ∗ 𝑆𝑖𝑚𝑅𝑎𝑛𝑘(𝑘)𝑁 (𝑘) ;
17. until SimRank algorithm is convergent
18. for each WSN 𝑁𝑖 ∈ 𝔻 do
19. for each page 𝑝𝑗 ∈ 𝑁𝑖 do
20. Output 𝑠𝑐𝑜𝑟𝑒𝑖𝑗 ;

The SimRank algorithm contains the following main phases:


(1) use Equation 7 to compute the similarity between pages
(lines 1-3); (2) treat the similarity as the distance between
pages and apply it to the k-means algorithm in order to
partition web database 𝔻 into distinct WSNs (lines 4-9); (3)
use line 11 to compute the relevance to the given query based
on Equation 4 and assign a probability value (i.e., 𝑤𝑖𝑗 ) of
browsing a page to be the initial page rank value of each Fig. 1. Visualization of the web crawler
page (line 12). Note that the main content of a crawled page
contains two parts: title and body. By experimental studies, we The user only needs to specify the seed URL in the text area
find that these two part content have different impact to the following the label “start url”, and the crawler can provide
query result, that is, the suffer often browses an interesting the following functions: (1) show the current obtained URL
page by firstly taking a look at the title of a page. So, we in the text area following the label “present url”, (2) show
assign a bigger weight value (i.e., 0.7) to the title than the the number of crawled pages and the remaining pages in
weight of 0.3 to the body, and these two weight values could the text area following the label “snatched” and “remaining”,
achieve good results by experiments; (4) iteratively compute respectively, and (3) illustrate the current crawled page at
the scoring value of each page by the weighted page rank the bottom of the illustrated interface, e.g., the homepage of
formula given in line 16 for each clustered WSN (lines 13- Southwest Jiaotong University.
17), where 𝑘 represents the page that links to 𝑝; and finally There are some principles for the web crawler to download
(5) output the score of each page in WSNs (lines 18-20). pages.
中国科技论文在线 http://www.paper.edu.cn

∙ If a page never occurred before, it could create a new does work by lots of iterative calculations, and it is time-
record in the database; intensive. However, SimCluster focuses on computing the
∙ A page which has already existed in the database could similarity between pages, and the time of this phase grows in
add its father pages (the pages that link to it) which appear a nearly linear manner which can be observed from Figure 2.
in the form of an array structure; 2) Accuracy Comparison: In order to further evaluate the
∙ The repeated father pages should be omitted. performance of SimCluster, we compare the accuracy of
clustering between SimCluster and k-means. Table I shows
VI. E XPERIMENTS the accuracy of clustering between these two approaches as
In this section, we perform experimental studies by compar- the number of pages increases from 10 to 1,000.
ing SimRank with our previously proposed centrality measures
TABLE I
based PageRank algorithm (called WebRank), and the tradi- C LUSTERING ACCURACY COMPARISON BETWEEN S IM C LUSTER AND
tional RageRank algorithm. Basically, the WebRank approach K - MEANS

integrates three centrality measures, i.e., degree, betweenness,


and closeness, to the PageRank algorithm in order to score No. of pages SimCluster k-means
pages, the detail is available in [17]. All algorithms are imple- 10 70.1% 66.5%
mented in the VS.net development platform and experiments 20 72.6% 70.5%
are conducted on an Intel T2300, 1.66GHz CPU with 1.5GB 30 75.1% 73.8%
of main memory, running on Window XP operating system. 50 78.2% 78.0%
All experiments were run on the real data that are crawled 100 82.0% 80.0%
from the Internet by our proposed web crawler. 200 83.0% 81.5%
300 85.0% 81.3%
A. Clustering Estimation of SimRank 400 85.5% 83.8%
In this section, we will evaluate the effectiveness and effi- 500 87.0% 81.8%
ciency of the clustering approach based on similarity measure 1000 85.5% 82.4%
derived from the vector space model, which has been applied
to SimRank and used to partition a web database into several In Table I, the first column is the number of pages, and the
WSNS. To facilitate expressing, we call the clustering ap- second and the third columns show the clustering accuracy
proach SimCluster and compare it with the k-means algorithm. of SimCluster and k-means under distinct number of pages.
1) Efficiency Analysis: In this set of experiments, we As we can see from the above table, SimCluster outperforms
compare the execution time of SimCluster with the k-means k-means in clustering pages in all cases with an average gap
algorithm as the number of pages grows from 10 to 1,000. of 3.15%. The reason is that SimCluster uses the similarity
The results are given in Figure 2. between pages to group pages which can exactly show the
topics or themes suffers are interested in. Based on the content
of a page, the query results can be easily accepted by suffers.
6000 k-means
SimCluster
B. Accuracy and Efficiency Comparison of Page Scoring
5000
In this section, we compare the scoring accuracy and
Execution time (sec.)

4000 execution time of SimRank with our previously proposed


WebRank algorithm and the traditional PageRank algorithm,
3000 respectively. Firstly, we observe the scoring accuracy as the
number of pages grows from 10 to 1,000, and the results are
2000
given in Table II.
1000 According to Table II, we can see that: (1) the scoring
accuracy of SimRank increases gradually, while the accuracy
0 of WebRank grows when the number of pages is small (less
10 20 30 50 100 200 300 400 500 1000
Number of web pages
than 200 pages), but when the number of pages grows large
than 200, the accuracy decreases. Because the prediction
Fig. 2. Execution time comparison between SimCluster and k-means accuracy of WebRank will fall down when the WSN becomes
very complex. However, there is no effect to SimRank even the
By Figure 2, we can see that the execution time of SimClus- WSN becomes intricate. Because SimRank use the similarity
ter is much less than that of the k-means algorithm in all cases, measure to group pages and score web pages in a clustered
i.e, SimCluster can reduce the runtime for about 16 times WSN, which can clean up useless pages and accurately
compared to k-means in average. This is because as for the k- show the relations among pages; (2) the scoring accuracy
means algorithm, the distance between pages is calculated by of PageRank increases with the number of pages. This is
the shortest path algorithm (i.e, the Floyd algorithm) which because PageRank adapts the authority transition mechanism
中国科技论文在线 http://www.paper.edu.cn

TABLE II
ACCURACY OF PAGE SCORING AMONG THREE ALGORITHMS order to accurately and efficiently rank the massive web pages,
we propose a new page rank algorithm based on similarity
No. of pages PageRank WebRank SimRank measure from the vector space model, called SimRank, to
10 62% 85% 78% score web pages. We firstly propose a new similarity measure
20 65% 84% 80% to compute the similarity between pages and use it to partition
30 70% 86% 81% a web database into several WSNs. And then, we improve the
50 71% 86% 83% traditional PageRank algorithm by assigning a probability of
100 72% 87% 85% browsing a page to be the initial page rank value of each paper.
200 74% 88% 87% In addition, we design a web crawler to efficiently download
300 75% 86% 88% web data. And finally, we conduct experiments to evaluate the
400 78% 87% 89% time efficiency and effectiveness of the proposed algorithms
500 78% 83% 89% with existing approaches.
Other future work will focus on the following research
1000 76% 81% 90%
directions: (1) improving the performance of the SimRank
algorithm, that is, reducing the time cost of scoring pages
based on PageRank, (2) extending SimRank algorithm to
to score pages. If the number of authoritative pages increases,
find the top (or key) actors in a WSN, and (3) designing a
the authority will transfer to other pages as well as improve the
web social network analysis system by integrating other web
scoring accuracy; (3) SimRank outperforms PageRank by an
mining algorithms.
average gap of 14.1% and have comparable ranking accuracy
with WebRank. The reason is that the PageRank algorithm R EFERENCES
focuses on computing the importance of pages in the Web, [1] Jon Kleinberg. The anatomy of a large-scale hypertextual web search
instead of a WSN. sngine. Journal of the ACM, 46(5):668–677, 1999.
We have to further compare the execution time of SimRank [2] S. Brin and L. Page. The anatomy of a large-scale hypertextual web
search engine. Computer Networks, 30(1-7):107–117, 1998.
with WebRank, and the experimental results are given in [3] Decai Huang and Huachun Qi. Pagerank algorithm research. Computer
Figure 3, where the 𝑥-axis is the number of pages and the Engineering, 32(4):145–162, 2006.
𝑦-axis represents the execution time corresponding to each [4] Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage
Data (Data-Centric Systems and Applications). Springer-Verlag New
algorithm. York, Inc., Secaucus, NJ, USA, 2006.
[5] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The
pagerank citation ranking: Bringing order to the web. Technical report,
35000 Computer Science Department, Standford University, 1999.
WebRank
SimRank [6] Ling Zhang and Fanyuan Ma. Accelerated ranking: a new method to
30000 improve web structure mining quality. Journal of Computer Research
and Development, 41(1):98–103, 2004.
Execution time (sec.)

25000 [7] Taher H. Haveliwala. Topic-sensitive pagerank: a context-sensitive


ranking algorithm for web search. IEEE Transactions on Knowledge
20000 and Data Engineering, 15(4):784–796, 2003.
[8] The open directory project: Web directory for over 2.5 million urls.
15000 http://www.dmoz.org/.
[9] Matthew Richardson and Pedro Dominggos. The Intelligent Surfer:
10000 Probabilistic Combination of Link and Content Information in PageR-
ank. MIT Press, Cambridge, MA, 2002.
5000 [10] Jennifer J. Xu and Hsinchun Chen. Crimenet explorer: a framework
for criminal network knowledge discovery. ACM Transactions on
0 Information Systems, 23(2):201–226, 2005.
10 20 30 50 100 200 300 400 500 1000 [11] Shaojie Qiao, Changjie Tang, Jing Peng, Wei Liu, Fenlian Wen, and Qiu
Number of web pages Jiangtao. Mining key members of crime networks based on personality
trait simulation e-mail analyzing system. Chinese Journal of computers,
Fig. 3. Execution time comparison between SimRank and WebRank 31(10):1795–1803, 2008.
[12] L. Freeman. Centrality in social networks: Conceptual clarification.
Social Networks, 1(10):215–239, 1979.
As we can see from Figure 3, the time efficiency of [13] D. A. Grossman and O. Frieder. Information Retrieval: Algorithms and
SimRank is about 1.08 times less than that of WebRank Heuristics. Springer, Secaucus, NJ, USA, 2004.
[14] G. Salton and M. McGill. An Introduction to Modern Information
on average. This is because WebRank needs to compute Retrieval. McGraw-Hill, New York, NY, 1983.
the shortest path between pages and uses it to calculate the [15] Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman, and Hsinchun
centrality measures (i.e., betweenness and closeness), which Chen. Analyzing terrorist networks: a case study of the global salafi jihad
network. In ISI 2005: IEEE International Conference on Intelligence
is time-intensive, while SimRank only needs to compute the and Security Informatics, pages 287–304. IEEE, Atlanta, Georgia, 2005.
similarity between different pages. [16] Gerard Salton and Chris Buckley. Term weighting approaches in
automatic text retrieval. Technical report, Ithaca, NY, USA, 1987.
VII. C ONCLUSIONS AND O NGOING W ORK [17] Shaojie Qiao, Jing Peng, Hong Li, Tianrui Li, Liangxu Liu, and Hongjun
Li. Webrank: A hybrid page scoring approach based on social network
As the rapid development of search engine techniques, the analysis. In RSKT 2010: The Fifth International Conference on Rough
Set and Knowledge Technology. Springer, Beijing, China, 2010.
suffers are more strict with the query results than before. In

You might also like