Professional Documents
Culture Documents
xx XXXX 200x
PAPER
1.
Introduction
The author is with the Department of Computer Engineering, Kasetsart University, Thailand.
The author is with the Graduate School of Fundamental Science and Engineering, Waseda University, Japan.
a) E-mail: un@mikelab.net
be more authoritative. A web page for an upcoming conference is clearly more interesting than a page
for a past conference; a link to the current version of
an open-source software is probably more useful than
a link to an old version. Indeed, one can argue that
search engines are most needed for nding newer, more
dynamic content; such items are less likely to be ndable via web directories (e.g. ODP [8]), the Wikipedia
[23], or other resources that a human user can turn to
for help.
In this paper, we propose a method to incorporate temporal aspects into the authority ranking of web
pages using a modied version of the PageRank algorithm. Three temporal factors are incorporated: age,
event, and trend ; these three capture the intuitive notion that a page with recent updates, special time occurrence, or high frequency of updates is potentially
more interesting to users. These attributes are computed from page timestamps (directly or by inference)
and a modication history that is stored during each
web crawl; the event and trend factors require categorizing pages and comparing the temporal history of
pages in the same category, as described in Sect. 3. We
discuss other algorithms that use temporal information
in Sect. 2, notably T-Rank [4] and TimedPR [24], which
are included in our experimental study.
The remainder of this paper is organized as follows. In Sect. 2, we briey review the PageRank algorithm and some related work. In Sect. 3, we rst
introduce the factors of temporal aspect, and then describe about how to apply them to our new variant of
PageRank. In Sect. 4, we report and discuss the results
of experiments. We oer conclusions and future work
in Sect. 5.
2.
r(k) (u)
,
out(u)
uB(v)
(2)
(3)
(4)
counteract bias against new pages. Yu et al. [24] propose another temporal aspect of PageRank in the context of scientic publications, called TimedPageRank
(TimedPR). However, their proposed technique cannot
be directly applied to the Web. Although the citation
concept of scientic papers is quite similar, some other
aspects are dierent from general web content. For instance, scientic papers have static information xed at
publication time. Since they cannot be deleted, their
citation counts are also monotonic increasing. In contrast, web pages can be modied and deleted both of
content and links over time. Scientic papers can only
recommend others further in the past; there are obviously no citation loops. On the Web, target-linked
pages can be subsequently modiedbecoming newer
than their referrers and provide higher quality information. This factor consequently eects the authoritativeness and should be considered.
Berberich et al. [4] specify a temporal window of
interest to analyze freshness and update frequency of
both pages and links. They apply these two factors in
PageRank computation, called Time-Aware Authority
Ranking (T-Rank ). However, their proposed technique
is somewhat impractical because the modication time
of links is hardly detectableusually approximated using the modication time of pages. In addition, the
results are not representative since they only reported
for a dataset on a specic domain. In fact, each domain usually has its own change behavior. For example, news pages having daily update should not dominate the others. Therefore, in dealing with the Web,
the Webs topics should also be considered.
3.
AF
1
Fig. 1
(6)
Event Factor
Temporal Factors
Time
TSEnd
T SEnd T SStart
if T SStart < tsLast (u) T SEnd , (5)
AF (u) =
0 otherwise.
3.1.2
3.1
TSStart
u Ci and
(7)
tsLast (u) Ij .
xi (u) =
xj (u).
(8)
j:tj Ii
Similarly, we let x (Cj ) be the r-dimensional vector representing the update prole of a category, where
Cj C and
uCj xi (u)
.
(9)
xi (Cj ) =
|Cj |
That is, x (Cj ) refers to the average prole of all pages
contained in the same category. We nally dene the
trend factor of a page u, denoted by T F (u), to be:
T F (u) = Sim(x (u), x (Cj )); u Cj .
(10)
3.1.3
Trend Factor
The age and event factors only consider the last modication time of web pages. We believe how often a
page is updated can also be exploited to estimate its
importance. A frequently updated page reects activeness on itself. However, the changes usually depend
on the kind of topic as well. Considering only the frequency of change may be biased and unfair to some
others which have already presented either correct or
reliable information. For instance, a news page will
likely be updated more frequently than a scientic one.
In our approach, a trend factor is dened to describe how similar the change behavior of a web page is
to others within the same topic, based on the assumption that most important pages will behave in the same
way. We rst introduce an update prole of each page
based on the changes over the observation period. Let
T = {ti (T SStart , T SEnd ]|1im and t1 <t2 <. . .<tm }
be a set of pre-dened timestamps. Let x(u) be the
m-dimensional vector representing the update prole
of page u:
1 if u was updated at timestamp ti ,
xi (u) =
0 otherwise.
However, using this prole to directly compare
pages will produce great dierences since there is little
possibility that web pages will be updated at the same
specic timestamps along the observation period. We
thus let I = {I1 , I2 , . . . , Ir } be a set of time intervals
such that Ii = (T Si1
, T Si ] for 1 i r, where T Si
is a pre-dened timestamp, T SStart = T S0 < T S1 <
< T Sr = T SEnd . Dene x (u), an r. . . < T Sr1
dimensional vector, to represent a new update prole
of u, where
3.2
Time-Weighted Ranking
u(B(v)D)
+ (1 )s(v).
(11)
Time-Weighted Transition
t(u, v)=
(12)
with the coecients 1 , 2 , 3 [0, 1]. Equation (12)
represents the probability of the page that a random
surfer selects for his next hop with some biases, i.e.,
the inverse-age, event, and trend values. In other
words, the surfer will choose the target page v by upto-dateness, event importance, as well as activeness, in
comparison to all targets.
5
Table 1
3.2.2
Age-Biased Vector
IAF (v)
.
wV IAF (w)
(13)
4.1
Dataset
Evaluation Measures
innovation
lyme disease
marketing
music
olympic
parallel computing
presidential election
risk management
rock climbing
science conference
soccer
socialism
sushi
tattoo
tournament
Experimental Results
affirmative action
amusement park
architecture
astronomy
blues
business
christmas
cruise
disaster
earthquake
fashion
film festival
game
graphic design
hurricane
|R1 R2 |
,
k
(14)
4.3
6
Table 2 Top-5 authority web pages produced by PR, TWPR,
T-Rank, and TimedPR, respectively.
OSim
0.9
0.8
0.7
PR
TWPR
T-Rank
TimedPR
0.6
0.5
2
3
0.8
0.7
31-01-2009
15-11-2008
31-10-2008
15-10-2008
30-09-2008
15-09-2008
31-08-2008
0.5
31-12-2008
0.6
15-12-2008
PR
TWPR
T-Rank
TimedPR
30-11-2008
KSim
0.9
en Europe)
http://europa.eu.int/ploteus/portal
CM Magazine: On Strike: The Winnipeg General Strike, 1919.
http://umanitoba.ca/outreach/cm/vol7/no4/onstrike.html
DRIIE | Department of international relations and european
integration |
http://www.die.ro
Japanese Public Holidays for 2004
http://www3.sympatico.ca/ccsr/j2004.html
Home Planet Release 3.3a
http://www.fourmilab.ch/homeplanet/homeplanet.html
2
3
4
5
http://www.filmweb.no
Census Bureau Home Page
http://www.census.gov
ShopMania - Price comparison in US, Read reviews
http://www.shopmania.com
National Institute of Neurological Disorders and Stroke (NINDS)
http://www3.sympatico.ca/ccsr/j2004.html
Statistici, clasament si trac web romanesc
http://www.trafic.ro
http://www.filmweb.no
Stummlm - ARTE
http://www.arte.tv/de/film/stummfilm-auf-arte/690880.html
Census Bureau Home Page
http://www.census.gov
ShopMania - Price comparison in US, Read reviews
http://www.shopmania.com
Al-Ahram Weekly | Front Page
http://weekly.ahram.org.eg
http://www.filmweb.no
ShopMania - Price comparison in US, Read reviews
http://www.shopmania.com
HIT100.ro
http://www.hit100.ro
Al-Ahram Weekly | Front Page
http://weekly.ahram.org.eg
Regina Public Schools
http://www.rbe.sk.ca
7
OSim
KSim
Table 3 Average in-degrees per web page for top-5, 10, and
20 of results produced by the ranking methods and the experts.
Similarity
0.8
PR
TWPR
T-Rank
TimedPR
PR
TWPR
T-Rank
TimedPR
Top-5
Top-10
Top-20
0.6
PR
TWPR
T-Rank
TimedPR
Experts
5343
3717
2272
3256
2597
1884
2556
2192
2021
2675
2024
1909
3852
3208
2004
0.4
0.2
0
Top-5
Top-10
Top-20
Top-5
Top-10
Top-20
C@1
C@2
C@3
4.4
OSim
KSim
OSim
KSim
OSim
KSim
I@1
I@3
I@5
I@7
0.55
0.52
0.58
0.53
0.62
0.59
0.55
0.54
0.59
0.57
0.66
0.62
0.56
0.55
0.64
0.61
0.68
0.66
0.56
0.54
0.63
0.61
0.68
0.64
Sensitivity Analysis
8
Table 5 Average similarities at top-10 between the experts
rankings and the rankings produced by TWPR setting C@3, I@5
but varying the values of 1 , 2 , and 3 .
Notation
OSim
KSim
TWPR A
TWPR E
TWPR T
TWPR AE
TWPR AT
TWPR ET
TWPR AET
1
0
0
1/2
1/2
0
1/3
0
1
0
1/2
0
1/2
1/3
0
0
1
0
1/2
1/2
1/3
0.59
0.52
0.55
0.63
0.66
0.61
0.68
0.54
0.44
0.48
0.58
0.63
0.55
0.66
Finally, we examine the eect of the remaining internal parameters. For this, we xed the external parameter values at C@3 and I@5. The seven combinations with their variations of three coecients are
shown in Table 5. From the results, we conclude that
the inverse-age factor is the most important, while the
trend and event factors have the second and the last impact, respectively. This means that the experts prefer
newer as well as up-to-date web pages rather than those
which were updated in the past. However, the combinations with more than one factor can produce better
rankings; especially, TWPR AET provides the best result.
5.
Conclusion
For many web queries, human users seek current or actively maintained information; hence, one would expect
that time-related factors of a web page can contribute
to establish its usefulness to human web searchers. Our
experiments show that this is indeed the case. Age,
event-related changes, and trend in revisions all appear
to contribute to improve page rankings, as compared
to the standard PageRank algorithm. Three temporalaware algorithms outperformed PageRank; the TWPR
algorithm, which incorporates all three factors, showed
the greatest improvement.
There are several issues which need to be addressed
in future work. First, we employed a small subset of
web pages in our experiments that had been categorized
by ODP. To deal with a complete Web, including unknown pages, an automatic system for Web categorization is needed. Second, a model for predicting change
in the Web is also essential to save time in crawling as
well as network bandwidth. Last, we plan to investigate
the content information and metadata in web pages and
integrate these in the computation for better rankings.
Acknowledgements
This research is supported by the Thailand Research
Fund through the Royal Golden Jubilee Ph.D. Program
(Grant No. PHD/0122/2548). We thank the members
of Yamana Laboratory for their help in data preparation and many helpful suggestions. We also thank
James Brucker for his intensive polishing of the paper.
References
[1] L.A. Adamic and B.A. Huberman, The webs hidden order, Communications of the ACM, vol.44, no.9, pp.5559,
2001.
[2] R.A. Baeza-Yates and B.A. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, England, pp.2730, 1998.
[3] R.A. Baeza-Yates, F. Saint-Jean, and C. Castillo, Web
structure, dynamics and page quality, Proc. 9th International Symposium on String Processing and Information
Retrieval, pp.117130, 2002.
[4] K. Berberich, M. Vazirgiannis, and G. Weikum, Timeaware authority ranking, Internet Mathematics, vol.2,
no.3, pp.301332, 2006.
[5] S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN
Systems, vol.30, no.17, pp.107117, 1998.
[6] J. Cho and S. Roy, Impact of search engines on page popularity, Proc. 13th International World Wide Web Conference, pp.2029, 2004.
[7] J. Cho, S. Roy, and R.E. Adams, Page quality: In search
of an unbiased web ranking, Proc. ACM SIGMOD International Conference on Management of Data, pp.551562,
2005.
[8] DMOZ, The Open Directory Project, http://www.dmoz.
org/.
[9] C. Dwork, R. Kumar, M. Naor, and D, Sivakumar, Rank
aggregation methods for the web, Proc. 10th International
World Wide Web Conference, pp.613622, 2001.
[10] e-Society Project, http://www.yama.info.waseda.ac.jp/
~yamana/e-society/index_eng.htm.
[11] G.H. Golub and C.F. Van Loan, Matrix Computations,
Johns Hopkins University Press, Baltimore and London,
1996.
[12] B. Goncalves, M.R. Meiss, J.J. Ramasco, A. Flammini, and
F. Menczer, Remembering what we like: Toward an agentbased model of web trac, Proc. 2nd ACM International
Conference on Web Search and Data Mining, 2009.
[13] G.R. Grimmett and D.R. Stirzaker, Probability and Random Processes, Oxford University Press, USA, 2001.
[14] T.H. Haveliwala, Ecient computation of PageRank,
Stanford Digital Libraries, 1999.
[15] T.H. Haveliwala, Topic-sensitive PageRank: A contextsensitive ranking algorithm for web search, IEEE Trans.
Knowledge and Data Engineering, vol.15, no.4, pp.784796,
2003.
[16] J.M. Kleinberg, Authoritative sources in a hyperlinked environment, J. ACM, vol.46, no.5, pp.604632, 1999.
[17] Y. Liu, B. Gao, T. Liu, Y. Zhang, Z. Ma, S. He, and H.
Li, BrowseRank: Letting web users vote for page importance, Proc. 31st ACM SIGIR Conference on Research and
Development in Information Retrieval, pp.451458, 2008.
[18] Lucene, The Apache Software Foundation, http://lucene.
apache.org/.
[19] M.R. Meiss, F. Menczer, S. Fortunato, A. Flammini, and
A. Vespignani, Ranking web sites with real user trac,
Proc. 1st ACM International Conference on Web Search
and Data Mining, pp.6576, 2008.
[20] L. Page, S. Brin, R. Motwani, and T. Winograd, The
PageRank citation ranking: Bringing order to the web,
Stanford Digital Libraries, 1998.
[21] S.M. Ross. Introduction to Probability Models, Academic
Press, 2002.
[22] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz,
Analysis of a very large web search engine query log,
ACM SIGIR Forum, vol.33, no.1, pp.612, 1999.
Bundit Manaskasemsak
received
the B.Eng. and M.Eng. degrees in Computer Engineering from Kasetsart University, Thailand, in 2003 and 2005, respectively. He has received the Royal Golden
Jubilee scholarship for the Ph.D. program
since 2005. He is a Ph.D. candidate in
Computer Engineering, Kasetsart University. His current research interests include web search, information retrieval,
and parallel and distributed computing.
Arnon Rungsawang
received the
B.Eng. degree in Electrical Engineering
from King Mongkut Institute of Technology, Thailand, in 1986; the DEA-IARFA
from Univesit
e de Pierre et Marie Curie
(Paris VI), France, in 1993; and the Ph.D.
degree in Informatique et Reseaux from
de lENST-Paris, France, in 1997. Since
1998, he has been a lecturer in Department of Computer Engineering, Kasetsart
University. His current research interests
include web search, information retrieval, parallel and distributed
computing, and articial intelligence.
Hayato Yamana
received the B.S.,
M.S., and Dr.Eng. degrees in Computer
Science from Waseda University, Japan,
in 1987, 1989, and 1993, respectively.
From 1993 to 2000, he was a researcher in
the Electorotechnical Laboratory, AIST,
MITI. From 2000 to 2005, he was an associate professor of Waseda University.
Since 2005, he has been a professor of
Waseda University. His current research
interests include data mining, distributed
computing, and information retrieval. He is the president of Information Grand Voyage Project Consortium since 2007. He is a
member of IPSJ, IEICE, ACM, and IEEE.