You are on page 1of 5

User Access Pattern Enhanced Small Web Search

Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen , Wei-Ying Ma,


Hong-Jiang Zhang Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA, USA 98052
Abstract
Current search engines generally utilize link analysis techniques to improve the ranking of returned
web-pages. However, the same techniques applied to a small Web, such as a website or an intranet
Web, cannot achieve the same level of performance because the link structure is different from the
global Web. In this paper we proposed a novel method of generating implicit link structure based on
users’ access patterns, and then apply a modified PageRank algorithm to produce the ranking of
web-pages. Our experimental results indicate that the proposed method outperforms keyword-based
method by 16%, explicit link-based PageRank by 20% and DirectHit by 14%, respectively.

PageRank by 20% and DirectHit by 14%, 1. INTRODUCTION


respectively.
Global search engines such as Google and
The rest of this paper is organized as follows. In
AltaVista are widely used by Web users to find
Section 2, we present the basic link structure of a
their desired information directly on the global
small Web. Section 3 analyzes user access pattern
Web. Nevertheless, to get more specific and up-to-
in a small Web. Section 4 describes the mining
date information, users usually go directly to
technology to construct implicit link structure.
websites as the very starting points and conduct
Section 5 discusses the ranking process. Our
search in those websites. Intranet search is another
experimental results are presented in Section 6. In
increasing application area which enables search on
Section 7, we present related work on small Web
the documents of an organization. In both cases,
search and link analysis. Finally, conclusions and
search occurs in a closed sub-space of the Web,
future works are discussed in Section 8.
which is called small Web search in this paper.
Existing small Web search engines generally
2. THE LINK STRUCTURE OF utilize the same search and ranking technologies as
SMALL WEBS those widely used on global search engines.
The link analysis algorithms such as PageRank and However, the performances of them are
HITS use eigenvector calculations to identify problematic. As reported in Forrester survey [3],
authoritative Web pages based on hyperlink current site-specific search engines fail to deliver
structures. The intuition is that a page with high in- all the relevant content, instead returning too much
degree is highly recommended, and should have a irrelevant content to meet the user’s information
high rank score. needs. In the survey, the search facilities of 50
websites were tested, but none of them received a
satisfactory result. Furthermore, in the TREC-8
Small Web Task, little benefit is obtained from the
use of link-based methods [1], which also
demonstrates the failure of exiting search
technologies in small Web search.
The reason of failure lies in several aspects.
First, it is generally accepted that ranking by
Figure 1: Link structure of the Global Web and a keyword-based similarity faces difficulties such as
small Web shortness and ambiguity of queries. Second, link
analysis technologies could not be directly applied
However, there is a basic assumption because the link structure of a small Web is
underlying those link analysis algorithms: the different from the global Web. We will explain this
whole Web is a citation graph (see the left plot in difference in detail in Section 2. Third, users’
Figure 1), and each hyperlink represents a citation access log could be utilized to improve search
or recommendation relationship. Formally, there is performance. However, so far few efforts in this
the following recommendation assumption. direction were made except DirectHit.
Recommendation assumption: a hyperlink in In this paper, we propose to generate implicit
page X pointed to page Y stands for the link structure based on user access pattern mining
recommendation of page Y by the author of page from Web logs. The link analysis algorithms then
X. use the implicit links instead of the original explicit
Henzinger [2] also points out there is a similar- links to improve the search performance. The
topic assumption underlying link analysis. In this experimental results reveal that generated implicit
paper, we consider similar-topic assumption to be links contain more recommendation links than
roughly true because a small Web is often about a explicit links. The search experiments on Berkeley
single domain. On the global Web, the website illustrate that our method outperforms the
recommendation assumption is generally correct keyword-based method by 16%, explicit link based
pages ordered by access time, we find all ordered because hyperlinks encode a considerable amount
pairs of web-pages that have minimum support. of authors’ judgment. Of course some hyperlinks
After that each implicit links are created for each are created not for the recommendation purpose,
two-item user access pattern (ij), with the weight but the influence of them could be reduced to an
being support of this pattern. ignorable level in the global Web.
Existing algorithms for sequential mining like However, this assumption is commonly invalid in
AprioriAll [4] do not fit our situation very well. For the case of a small Web. As depicted in the right
example, when we compute the frequency of the part of Figure 1, the majority of hyperlinks in a
two-item candidate pattern (DC), a long path small Web are more “regular” than that in the
such as (DA……VC), where the distance global Web. For example, most links are from a
between D and C is large, will be taken into parent node to children nodes, between sibling
consideration. That is, a long path also contributes nodes or from leaves to the root (e.g. “Back to
to the frequency of (DC). According to the Home”). The reason is primarily because
analysis in Berkhin et al. , when the interval of the hyperlinks in a small Web are created by a small
page pairs in the path is larger than 6, the topics of number of authors; and the purpose of the
those pages may not be relevant any more. hyperlinks is usually to organize the content into a
Moreover, we only need to create the two-item hierarchical or linear structure. Thus the in-degree
sequential patterns. measure does not reflect the authority of Web
Thus, we propose the following algorithm. First, a pages, making the existing link analysis algorithms
gliding window is used to create two-item ordered not suitable for this domain.
pairs. Second the frequency of each ordered pair is
computed, and all the infrequent pairs are filtered 3. IMPLICIT
out with a user-specified minimum support to RECOMMENDATIONLINK
generate the two-item sequential pattern.
In general, when a user wants to find some useful GENERATION
information on a website, he may follow several The data source of generating access patterns is
pages to reach the destination. According to Web logs collected from a small Web. We first
another analysis in Berkhin et al., the average preprocess the logs and segment them into
length of a session is more than 3. Based on this browsing sessions. The preprocess procedure
observation, we use a gliding window to move over consists of three steps: data cleaning, session
the path within a session to generate the ordered identification and consecutive repetitions
pairs. Here, the window size represents the elimination.
intervals user click between the source page and the An entry in a Web log contains the following
target page. In Table 1, we provide an example information: the timestamp of a traversal from a
transaction set and the corresponding two-item source page to a target page, the IP address of the
ordered pairs extracted by the gliding window of originating host, the type of access (GET or POST)
size equal to 2. and other data. Since we do not care about the
We obtain all possible ordered pairs with frequency images or scripts embedded in the web-pages, we
from a large amount of users browsing sessions. simply filter out the access entries for them. After
Infrequent occurrences often represent random cleaning the data, we simply distinguish the user by
behaviors and should be removed. Precisely, the their IP address, i.e. we assume that consecutive
support of an item i, denoted as supp(i), refers to accesses from the same IP during a certain time
the percentage of the sessions that contain the item interval come from the same user. In order to
i. Similarly, the support of a two-item pair (i, j), handle the case of multi-users with the same IP, we
denoted supp(i, j), is defined in a similar way. A remove IP addresses whose page hits count exceeds
two-item ordered pair is frequent if its support some threshold. Afterward, the consecutive entries
supp(i, j) min-supp, where min_supp is a user are grouped into a browsing session. Different
specified number. grouping criteria are modeled and compared in [5].
Filtering the infrequent occurrences with a In this paper, we choose the criteria similar to [5],
frequency threshold of 3 (minimum support of 0.6), i.e., a new session starts when the duration of the
the resulting two-item sequential patterns are whole group of traversals exceeds a time threshold.
shown in Table 2. Consecutive repetitions within a session are then
Table 1: Sample Web transactions and Candidate eliminated, e.g. compact the session ((A, A, B, C,
pattern C, C) to (A, B, C)). After preprocessing, we obtain
a series of user browsing sessions S=(s1, s2, …,
# Transaction Candidate Pattern sm), where si=(ui: pi1, pi2, …, pik), ui is the user
T {ABDE} AB, AD, BD, id and pij is the web-pages in the browsing path.
1 BE, DE The process of two-item sequential pattern mining
T {ABECD} AB, AE, BE, BC, is outlined below. Given a set of page-visiting
2 EC, ED, CD transactions where each transaction is a list of web-
Since PageRank is an algorithm which is query T {ABEC} AB, AE, BE, BC,
independent, in the next step we hope to adapt our 3 EC
algorithm to rank the pages relevant to a given T {BEBAC} BE, EB, EA, BA,
topic. 4 BC, AC
Another possible direction is to apply our mined T {DABEC} DA, DB, AB,
implicit links to improve web-pages clustering 5 AE, BE, BC, EC
accuracy. Existing solutions on clustering web-
Table 2: Two-item sequential pattern with the
pages are based on the content and explicit links of
min_supp=3
web-pages. As we noted earlier, the explicit links in
a specific website is only for content organization, Ordered Frequency
so it is difficult to achieve good clustering result by pairs
this kind of links. We plan to combine our mined
AB 4
implicit links and the contents of web-pages
AE 3
together to cluster web-pages effectively.
Furthermore, to improve the effectiveness of BC 4
website, we may discover the gap between the BE 5
website designer’s expectation and the visitor’s EC 3
behavior by comparing the importances of web- Total 0
pages calculated from explicit links and implicit
links, and then suggest a modification to the site
structure to make it more effective for browsing
and navigation. Frequency
6
5
4. ACKNOWLEDGMENTS 4
3
This work was performed in Microsoft Research, 2
Asia. Many thanks to Jian-Yun Nie for his very 1
helpful advices. 0
A®B A®E B®C B®E E®C

Frequency
5. REFERENCES[CITATION Figure 2: Frequecnies
FRE87 \l 1033 ]

CONCLUSION AND FUTURE WORK


In this paper, we proposed a novel small Web
References.txt search model by mining the users’ browsing
behaviors in a small Web. First, two-item
sequential pattern mining algorithm is applied to
find out the frequent user access patterns from the
Web log. Then, the mined access patterns are
considered as the implicit links of the web-pages
within the small Web to replace the original
explicit links. Furthermore, the weights of implicit
links in the adjacent matrix of website are assigned
by the support of mined access patterns instead of
simple 0 or 1. Finally, the PageRank algorithm is
applied on the adjacent matrix to calculate the rank
score of each web-page. And this rank score is
interpolated with similarity from full-text search to
rerank search results. Our experimental results
showed that our proposed method outperforms the
existing solutions. In other words, the mined access
patterns represent the latent relationships of the
web-pages which provide better information in
recommending a web-page, which is crucial to the
success of this kind of link obeys the
recommendation assumptions of the PageRank
algorithm.
List of Tables
3 Table 1: Sample Web transactions and Candidate pattern
4 Table 3: Two-item sequential pattern with the min_supp=3

List of Figures
2 Figure 1: Link structure of the Global Web and a small Web
4..............................................Figure 2: Frequecnies

Table of Content

Contents
1. User Access Pattern Enhanced Small Web Search
1 Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen , Wei-Ying Ma, Hong-Jiang Zhang Microsoft Research
1............................................Microsoft Corporation
1.................................................One Microsoft Way
1..................................Redmond, WA, USA 98052
2...................................................................Abstract
2 INTRODUCTION..................................................1
2. THE LINK STRUCTURE OF SMALL WEBS 2
3. IMPLICIT RECOMMENDATIONLINK GENERATION 3
4. ACKNOWLEDGMENTS.............................4
5. REFERENCES (MCIT).................................4

You might also like