You are on page 1of 8

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 12, DECEMBER 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 97

Rank Optimization and Query


Recommendation in Search Engines using
Web Log Mining Techniques
Neelam Duhan, A. K. Sharma

Abstract—Query Logs are important information repositories, which record user activities on the search results. The mining of
these logs can improve the performance of search engines. Search engines generally return long lists of ranked pages, finding
the desired information content from which is typical on the user end and therefore, search result optimization techniques come
into play. The proposed system based on learning from query logs predicts user information needs and reduces the seek time of
the user within the search result list. To achieve this, the method first mines the logs using a novel similarity function to perform
query clustering and then discovers the sequential order of clicked URLs in each cluster using the modified version of an
existing Sequential Pattern Mining algorithm. Finally, search result list is optimized by re-ranking the pages using the discovered
sequential patterns. The proposed system proves to be efficient as the user desired relevant pages occupy their places earlier
in the result list and thus reducing the search space. The paper also presents a query recommendation scheme towards better
information retrieval.

Index Terms—World Wide Web, Information Retreival, Query Log, Web Usage Mining, Ranking Algorithm.

——————————  ——————————

1 INTRODUCTION

W ITH the popularity of the internet, the information


on the World Wide Web [1] is growing dramatical-
ly. Search engines are information retrieval sys-
This problem is referred to as the Information Overkill
problem [4]. Although many search engines apply rank-
ing, clustering and other web mining methodologies to
tems that help users to locate needed information by pos- optimize their search results, there still remains a chal-
ing their queries. Users express their queries on the inter- lenge in providing the user required content with less
face of search engine through a combination of keywords. navigation overhead.
Initially, search engines were using traditional informa- Search engines must have a mechanism to find the us-
tion retrieval techniques, where keyword based similarity ers’ interests with respect to their queries and then optim-
function between the query and the documents was used ize the results correspondingly. To achieve this, query log
to identify the required documents. This approach was files maintained by the search engines can play an impor-
prone to poor quality of search results. Recent researches tant role. The logs provide an excellent opportunity for
have proposed ranking methodologies [2, 3] to use lin- gaining insight into how a search engine is used and what
kage structure of the web, instead of using the content, to the users’ interests are. Today, many Web applications
improve the search result quality. are applying Web usage mining techniques to predict
In spite of the recent advances in the Web search en- users’ navigational behavior by automatically discovering
gine [1] technologies; there are still many situations, in the access patterns from one or more log files, but none
which user is presented with undesired and non-relevant have used them for search engine’s result optimization. In
pages in the top most results of the ranked list. One of the this paper, the purpose of web log mining is to improve
major reasons for this problem is the lack of user know- performance of the search engine by utilizing the mined
ledge in framing queries. Moreover, search engines often knowledge.
have difficulties in forming a concise and precise repre- A novel approach for result optimization and query
sentation of the response pages corresponding to a user recommendation is proposed in this paper, which at-
query. Nowadays, providing a set of web pages based on tempts to optimize the search engine’s results by improv-
user query words is not a big problem in search engines, ing their page ranks and thus increasing the relevancy of
rather the problem appears at the user side as he has to the pages according to users’ feedback. The approach also
sift through the long result list to find his desired content. recommends the user with a set of similar and most pop-
ular user queries so as to make his search more efficient
———————————————— than the previous one. To perform the required task, the
 Neelam Duhan is with the Department of Computer Engg, YMCA Univer- approach pre-mines the query logs to retrieve the poten-
sity of Science & Technology, Faridabad, India. tial clusters of queries and then finds the most popular
queries in each cluster. Each cluster entries are again
 Dr. A. K.Sharma is with the Department of Computer Engg, YMCA Uni- mined to extract sequential patterns of pages accessed by
versity of Science & Technology, Faridabad, India.
the users. The outputs of both mining processes are uti-
lized to return relevant pages to the user as well as re-
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 12, DECEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 98

commending him with popular historical queries. The requirement of a user could not be matched [16]. In this
paper has been organized as follows: Section 2 describes manner, the most relevant web pages to users’ query
the basis terminologies forming the basis of the proposed words will not be shown at the top of the search result
work and literature work done under them. Section 3 ex- list. Google uses PageRank Algorithm to calculate the
plains the proposed optimization system along with ex- rank score and combines the text matching scores of the
amples. The experimental evaluation of the work is given page at the query time to return the important and rele-
in section 4 and section 5 concludes the paper with some vant pages. The PageRank formula can be defined as:
discussion on future research. PR ( v )
PR ( u )  (1  d )  d 
v B (u ) Nv
(1)
2 PRILIMINARIES AND RELATED WORK where u represents a web page, B(u) is the set of pages
2.1 Query Logs that point to u. PR(u) and PR(v) are rank scores of page u
The notion of Web Usage/Log Mining has been a subject and v, respectively. Nv denotes the number of outgoing
of interest since many years. In the context of search en- links of page v, d is called damping factor and is usually
gines, servers record an entry in the log for every single set to 0.85.
access they get corresponding to a query. Thus, the log
2.3 Sequential Pattern Detection
mainly contains users’ queries and corresponding clicked
URLs, as well as other information about their browsing The problem of mining sequential patterns of web pages
activities. The typical logs [5] of search engines include has been an active area of research as it helps in finding
the following entries: 1) User (session) IDs, 2) Query q the order in which the web pages get visited by the users.
issued by the user, 3) URL u accessed/clicked by the user AprioriAll [17, 18] finds all patterns but it is a three-phase
4) Rank r of the URL u clicked for the query q and 5) Time algorithm. It first finds all item-sets with minimum sup-
t at which the query has been submitted. An example port, transforms the database so that each transaction is
illustration of the query log is shown in Table 1. replaced by the set of all frequent itemsets contained in
the transaction, and then finds sequential patterns for the
TABLE 1 same. Empirical evaluation by the researchers indicates
A SAMPLE CLICKTHROUGH DATA FOR THE QUERY LOG that GSP [17, 19] is much faster than the AprioriAll algo-
rithm.
UserID Query Clicked URL r Time A critical look at the available literature indicates that
27098 Data mining datamining/typepad.com 6 2009-03-01 00:01:10 the modern day search engines are using different opti-
27098 Data mining dataoutsourcingindia.com 5 2009-03-01 00:01:10 mization measures [20] on their search results in a way or
48785 Database webopedia.com/database.html 5 2009-03-01 00:01:16 another so as to better satisfy the user needs, but the user
48785 Database en.wikipedia.org/wiki/database 7 2009-03-01 00:01:16 is still presented with either irrelevant search results or
48785 Database http://www.cmie.com 6 2009-03-01 00:01:16 problems of finding the required information within the
searchcrm.techtarget.com/defini search results. The proposed method gives prime impor-
25112 Web mining 5 2009-03-01 00:01:40
tion/Web-mining tance to the information needs of the user and optimizes
cse.iitb.ac.in/~soumen/mining- the rank values of returned web pages according to the
25112 Web mining 4 2009-03-01 00:01:40
the-web/ history of pages accessed by them.
….. …….. ………….. ... ……..

A number of researchers have discussed the problem 3 PROPOSED OPTIMIZATION SYSTEM


of analyzing these query logs [6, 7]. The information con- The proposed optimization method dynamically predicts
tained in query logs has been used in many different user information needs from historical query logs and
ways [8, 9], for example to provide context during search, based on these predictions optimizes the user search by
to classify queries, to infer search intent, to facilitate per- recommending him with most favored query related to
sonalization etc. In various studies, researchers and his search and returning the desired relevant pages in the
search engine administrators have used information from top of the search result list. The proposed system works
query logs to learn about the search process and to im- in the following four steps:
prove search engines [10, 11]. Besides learning about
Step 1: The main aspect of the system is to perform
search engines or their users, query logs are also being
query clustering by finding the similarities between
used to infer semantic concepts or relations [5].
the pair-wise queries, which is based on query key-
2.2 Page Ranking words and user browsing behavior. Similarity based
Majority of search engines return the ranked representa- on browsing behavior is calculated by means of a pro-
tion of their search results. For ranking the pages, various posed graphical technique.
algorithms have been introduced in the literature e.g. Pa- Step 2: Most popular/favored queries are discovered
geRank [12, 13] and Hypertext Induced Topic Selection within each query cluster.
(HITS) [2], which are based on link-oriented approaches Step 3: The next step taken by the system is the sequen-
[14, 15]. The rank score is calculated by some sophisti- tial patterns of web pages visited by the users within
cated approaches and is independent of users’ query each cluster with the help of modified GSP (Genera-
words. Therefore, the relation between web pages and the lized Sequential Patterns) algorithm [17].
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 12, DECEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 99

Step 4: The final approach is to optimize the user’s web patterns which were discovered offline. The Query Re-
search based on the outputs of the previous two steps. commender guides the user with similar queries with the
most famous query highlighted.
Optimization in step 4 is carried out by re-ranking the
The working and algorithms for different functional
search result list by modifying the already assigned rank
modules are explained in the next subsections.
score of the web pages using the discovered sequential
patterns. The rank updation improves the relevancy of a 3.2 Query Similarity Analyzer
web page based on its access history. Step 2 is used to This approach is based on two principles: similarity based
recommend the user with the most famous query along on the query keywords and cross-references. These prin-
with many similar queries for a better search. ciples are formulated below:
3.1 Proposed Architecture 3.2.1 Similarity Based on Query Keywords
The architecture for the proposed system is shown in Fig. If two user queries contain the same or similar terms, they
1, which consists of the following functional components: denote the same or similar information needs. The follow-
1. Similarity Analyzer ing formula is used to measure the content similarity be-
2. Query Clustering Tool tween two queries.
3. Favored Query Finder KW ( p, q )
4. Sequential Pattern Generator Simkeyword ( p, q ) 
5. Rank Updater kw( p )  kw(q )
(2)
6. Query Recommender where kw(p) and kw(q) are the sets of keywords in the
queries p and q respectively, KW(p, q) is the set of com-
Index
mon keywords in two queries.

Query
Matched 3.2.2 Similarity Based on User Feedback
Pages
Final
Results
Query Terms Two queries are considered to be similar if they share or
Interface of Query
result in the selection of same or similar documents. This
Query
Search Engine Processor
Query Log principle is based on Beeferman and Berger’s agglomera-
tive clustering algorithm [6]. The approach is content-
ignorant, which means that the algorithm does not make
Query
Recommendations
Matched
Pages
Optimized
Ranked Pages
Set of Queries
and use of the actual content of the queries and the documents
Clicked URLs
in clustering. Similarity analyzer first constructs a bipar-
tite graph with one set of vertices corresponding to que-
Query Rank
Similarity Analyzer ries, and the other corresponding to documents as shown
Recommender Updater
in Fig. 2.
Similarity
Values
Set of Q1 Q2 Q3 Q4
Patterns
Query Clustering Tool Query Set 
Favored Query and Pattern
Similar Queries
10 100 10 10
Generator Query
Clusters 1000 10
10 100 100 1000
Favored Query
Query Cluster Database
Finder Query Clusters
Document Set 
Matched Query Clusters D1 D2 D3 D4 D5 D6 D7

Fig. 1. Architecture of Proposed Optimization System Fig. 2. A Biparite Graph of Query Log.

When user submits a query on the search engine inter- A query vertex is joined with a document vertex if
face, the query processor component matches the query document has been accessed/clicked by a user corres-
terms with the index repository of the search engine and ponding to the said query. The numerical integer on each
returns a list of matched documents in response. On the edge dictates the number of accesses to the document by
back end, user browsing behavior including the submit- distinct users for a particular query. For example a value
ted queries and clicked URLs get stored in the logs and 10 between Q1 and D1 says that 10 users have clicked on
are analyzed continuously by the Similarity Analyzer D1 corresponding to Q1. In the figure above: D1, D2, D4
module, the output of which is forwarded to the Query are accessed with respect to Q1, thus are relevant to Q1
Clustering Tool to generate groups of queries based on and D2, D3, D4 are relevant to Q2 and so on. As Q1 and
their similarities. Favored Query Finder extracts most Q2 share two documents D2 and D4, they can be consi-
popular queries from each cluster and stores them for dered similar but similarity is decided on the basis of
future reference. The Pattern Generator module discovers number of document clicks.
sequential patterns of web pages in each cluster. The If two queries p and q share a common document d,
Rank Updater component works online and takes as in- then similarity value is ratio of the total number of dis-
put the matched documents retrieved by query processor. tinct clicks on d with respect to both queries and the total
It improves the ranks of pages according to sequential number of distinct clicks on all the documents accessed
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 12, DECEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 100

for both queries. If more than one document is shared, multiple clusters. Each returned cluster is stored in the
then numerator is obtained by summing up the document Query Cluster Database along with the associated que-
clicks of all common documents. The following formula ries, query keywords and the clicked URLs. The cluster-
dictates the similarity function based on document clicks: ing algorithm takes O(n2) worst case time to find all the
 LC ( p, di)  LC (q, di)
diCD ( p )  CD ( q )
query clusters, where n is the total number of queries.
SimclickURL ( p, q) 
 LC ( p, xi)  LC (q, xi)
xiCD ( p )  CD ( q )
Algorithm: Query_Clustering(Q, α, β, )
Given: A set of n queries and corresponding clicked
(3)
where LC(p,d) and LC(q,d) are the number of clicks on URLs stored in an array Q[qi , URL1, ...,URLm],
document d corresponding to queries p and q respective- 1≤i≤n
ly. CD(p) and CD(q) are the sets of clicked documents cor- α = β = 0.5
responding to queries p and q respectively. Similarity threshold 
As an example illustration, Q1 and Q2 share two Output: A set C= {C1, C2,…Ck} of k query clusters
common documents D2 and D4, while D1, D2, D3 and D4
are accessed either by Q1 or Q2 or both. The similarity // Start of Algorithm
between two queries is: k = 1; //k is the number of clusters.
(100  10)  (1000  100) For (each query p in Q)
SimclickURL (Q1, Q 2)  Set ClusterId(p)= Null; // initially no query is clustered
(10  0)  (100  10)  (0  10)  (1000  100)
For (each p  Q )
= 0.984
{
Similarly, the similarity between Q1 and Q3 is: ClusterId(p)= Ck;
1010 Ck={p};
SimclickURL (Q1, Q3)   0.455
2220 For (each q  Q such that p q )
The similarity values always lie between 0 and 1. The {
measure given in (3) declares two queries similar by im- Simkeyword ( p, q) 
KW ( p, q )
posing a threshold value on their similarity value. kw( p )  kw(q )

3.2.3 Combined Similarity Measure  LC ( p, di)  LC (q, di)


diCD ( p )  CD ( q )
The two criteria have their own advantages. In using the SimclickURL ( p, q) 
first criterion, queries of similar compositions can be
 LC ( p, xi)  LC (q, xi)
xiCD ( p )  CD ( q )

grouped together. In using the second criterion, benefit


can be taken from user’s judgments. Both query key- Simcombined ( p, q )    Simkeyword ( p, q )    SimclickURL ( p, q )
words and the corresponding document clicks can par-
If (simcombined(p,q)  ) then
tially capture the users’ interests when considered sepa-
set ClusterId(q)= Ck;
rately. Therefore, it is better to combine them in a single
Ck= Ck {q};
measure. A simple way to do it is to combine both meas-
else
ures linearly as follows:
continue;
Simcombined ( p, q)    Simkeyword ( p, q)    SimclickURL ( p, q)
(4) }// end for
where  and ß are constants with 0α,ß1 and α+ß=1. The k= k+1;
values of constants can be decided by the expert analysts } //end outer for
depending on the importance being given to two similari- Return Query cluster set C.
ty measures. In the current implementation, these para-
meters are taken to be 0.5 each. Fig. 3. Algorithm for Clustering the Queries.

3.3 Query Clustering Tool


Let us take an example of query sessions from a query
Query Clusters represent clearly defined intentions of
log, which is shown in Table 2. For calculating the query
search engine users. For obtaining these clusters, the
similarity, any one of the measures (2, 3) or the combined
Query Clustering module uses the algorithm shown in
measure (4) can be utilized.
Fig. 3, where each run of the algorithm computes k clus-
The three cases given below describe the clusters ob-
ters. As query logs are dynamic in nature, query cluster-
tained by using different measures:
ing algorithm should be incremental in nature.
Case 1: If the keyword-based measure (2) with threshold
The algorithm is based on the simple perspective: in-
value 0.6 is applied, the queries are divided into 3 clusters
itially, all queries are considered to be unassigned to any
C1, C2 and C3, where:
cluster. Each query is examined against all other queries
C1: Query 1
(whether classified or unclassified) by using (4). If the
C2: Query 2
similarity value turns out to be above the pre-specified
C3: Query 3 and query 4
threshold value ( ), then the queries are grouped into the
Here, queries 1 and 2 are not grouped together.
same cluster. The same process is repeated until all que-
Case 2: If clicked_documents based measure (3) is applied
ries get classified to any one of the clusters. The algorithm
with threshold value set to 0.7, the clusters formed are:
returns overlapped clusters i.e. a single query may span
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 12, DECEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 101

C1: Query 1 and query 2 3.5 Sequential Pattern Generator


C2: Query 3 This component takes query clusters as input and finds
C3: Query 4 the sequential patterns in each cluster. To achieve this,
In this case, queries 3 and 4 are not found to be similar. modified version of GSP algorithm [17] is called, which is
Case 3: Now let us use the combined measure (3), where shown in Fig. 5. Mod_GSP() makes multiple passes over
 and  are set to 0.5 and similarity threshold  also set to the URL sequences stored in each query cluster. The set of
0.5. The queries are now clustered in the desired way: queries in a particular cluster constitute the session set
C1: Query 1 and query 2 and there is a sequence of clicked URLs corresponding to
C2: Query 3 and query 4 each session (query). Given a set of frequent n-1 length
By analyzing the results obtained above through dif- patterns of URLs, the candidate set for next generation
ferent approaches, it is determined that the combination are generated from input set according to minimum sup-
of both keyword and clicked documents based approach port thresholds.
is more appropriate for query clustering.
Algorithm: Mod_GSP(QC, min_sup )
TABLE 2
AN EXAMPLE QUERY LOG FOR QUERY CLUSTERING Given: A Query Cluster QC with a set of URLs/pages,
Sr. No. Query UserId Clicked URL
Threshold value min_sup //for pruning the candidates
Output: Set of frequent sequential Patterns
Data Mining 1571911 http://www.A
1
1571120 http://www.B
// Start of Algorithm
Data Warehousing 1572790 http://www.B
2 Result= ; // result set contains all sequential patterns
1572300 http://www.A
P1 = Set of Frequent 1-Lengh Page Sequence;
1571100 http://www.P
// Pages whose support  min_sup
3 Data Base System 1571911 http://www.Q
k=2;
1573421 http://www.R
While (P(k-1) != Null)
Data Base Management 1571100 http://www.A {
4
System 1571202 http://www.Q Generate candidate sets Ck; //set of candidate k-sequences
….. …….. ……. ……. For (all page sequences S in the cluster QC)
{
3.4 Favored Query Finder Increment count of all c  Ck if S supports c;
Once query clusters are formed, next step is to find a set Pk = {c  Ck | sup(c)  min_sup} //pruning phase
of favored/popular queries from each cluster. A query is k= k+1;
said to be favored which occupies a major portion of the Result = Result  Pk
search requests in a cluster and it is assumed that, in gen- }
eral, it is the query submitted by most of the users. The } // end while
process of finding favored queries is shown in Fig. 4, Return result.
which finds the favored query/queries in one cluster. The
process is applied on all the clusters and output is stored Fig. 5. Algorithm for Modified GSP for finding Page Sequences.
again in the Query Cluster Database in the form of <Clus-
terId, favored query> pairs. Only frequent patterns in the current set are consi-
dered for generating the next candidate sequence. For
candidate generation, it merges pairs of frequent subse-
Algorithm: Favored_Query_Finder( )
quences found in (k-1)th pass to generate candidate k-
I/P: A cluster of Queries.
length sequences. A frequent (k-1)-sequence w1 is merged
O/P: True or False.
with another frequent (k-1)-sequence w2 to produce a
candidate k-sequence if the subsequence obtained by re-
//Start of Algorithm
moving the first event in w1 is the same as the subse-
1. Queries which are exactly same club them and make
quence obtained by removing the last event in w2 e.g.
a set of the <query, IP addresses> pairs.
merging the page sequences w1=<{C} {D E} {F}> and w2
2. For (each q  Cluster)
=<{D E} {F G}> will produce the candidate page sequence
Calculate the Weight of query as:
<{C} {D E} {F G}> because the last two events in w2 (F and
No. of IP addresses which fired the query
Wt  G) belong to the same element.
Total no. of IP addresses in that cluster A pruning phase eliminates subsets of infrequent pat-
If (Wt >= threshold value) then terns. For all patterns P in the candidate set with length k,
Return True; //Query is considered as favored query all URL sequences are processed once and the count is
else incremented for each detected pattern in the candidate
Return False; //query is considered as disfavored. set. At each iteration, candidate k-sequences whose sup-
port is less than min_sup are eliminated by the module.
Fig. 4. Algorithm for Favored Query Finder. An example showing the pattern generation of web
pages (say A, B, C, D and E) is given in Fig. 6 to better
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 12, DECEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 102

understand the algorithm. The figure shows the final ite- where lenpat(X) is the effective length/depth of the sequen-
ration of candidate generation and pruning phase. tial pattern in which X occurs and level(X) is the depth of
X in the pattern.
Frequent
Considering the example pattern of Fig. 7, the weights
3-Sequences of pages comes out to be:
Candidate Weight (A)= ln (3)/1= 1.099
<{A}{B}{C}> Generation
<{A}{BE}>
Weight (B)= 0.549 etc.
<{A}{E}{C}> <{A}{B}{C}{D}> Candidate
<{B}{C}{D}> <{A}{BE}{C}> Pruning
A
<{BE}{C}> <{A}{E}{CD}>
<{C}{D}{E}> <{B}{C}{D}{E}> <{A}{BE}{C}>
<{E}{CD}> <{BE}{CD}> B E

Fig. 6. An Example of Sequential Pattern Generation. C

3.6 Rank Updater Fig. 7. Pictorial Representation of a Sequential Pattern.


This module takes input from the query processor in the
form of matched documents of the user query and applies
3.6.2 Rank Improvement
update on the rank score of these pages based on detected
sequential patterns. The module operates online at the The rank of a page can be improved with the help of its
query time. Here those documents are considered for assigned weight. The new rank now becomes:
rank update, which are most frequently accessed by users New _ Rank ( X )  Rank ( X )  Weight ( X ) (6)
and appear in any one of the sequential patterns. The up-
dater works in the following steps: where rank(X) is the existing rank value (PageRank) of
page X and weight(X) is the popularity given to X.
Step 1: Given an input user query q and a set of
matched documents D returned by query processor, 3.7 Query Recommender
the cluster C is found to which the query q matches. This component provides the user with a set of recom-
This can be done by matching the query keywords mended queries with the most popular query being hig-
with the set of keywords in every cluster. hlighted. The recommended queries are the queries
Step 2: The sequential patterns of the concerned cluster which are similar to the user submitted query and thus,
are retrieved from the local repository being main- are contained in the cluster of that query. For example,
tained by the Sequential Pattern Generator. the recommendations of a query Data Base are:
Step 3: A pattern weight (described below) is calcu- Data Base Environment
lated for every page d, where d  Pages in Set of se- Data Base Management System
quential patterns. Data Base System
Step 4: Final rank of a page d  D is computed if it is Data Repository
present in the set of sequential patterns of cluster C. Basics of Data Base
The improved rank is calculated as the summation of The recommended queries are sorted with popular
previous rank and assigned weight value. query being highlighted (here underlined). When user
submits a query, its keywords are matched in Query clus-
By improving the rank on the basis of user feedbacks, the ter database and the queries in the matched cluster are
results of a search engine can be optimized so as to better outputted by the Query Recommender on the interface of
serve the user needs. As a result, user can now find the search engine. The user can continue with the same query
popular and relevant pages upwards in the result list. or can choose any one of the recommendations.
Following are given the methods to find the weight and
improved rank of a page.
4 EXPERIMENTAL RESULTS
3.6.1 Weight Calculation
To show the validity of the proposed architecture, a
The weight of the URL is estimated to be its popularity
fragment of sample query log is considered (given in Ta-
and order of access corresponding to the user query.
ble 3). Because the actual number of queries is too large to
Suppose a sequential pattern <{A}{B E}{C}> belongs to a
conduct detailed evaluation, only 14 query sessions are
query cluster matched with the user query. This pattern
chosen in present illustration. The following functions are
can be graphically presented as shown in Fig. 7.
tested on the 14 query sessions:
The page (more precisely the URL) A lies at level 1,
1. Keyword similarity (Simkeyword),
page B and E are on level 2, while C being at level 3. The
2. Similarity using documents clicks (Simclick),
weight of a page X is inversely proportional to its posi-
3. Similarity using both keyword and document
tion in the sequential pattern and is calculated as:
clicks (Simcombined)
ln(len pat ( X ) ) 4. Query clustering
Weight ( X ) 
level ( X ) 5. Modified GSP algorithm
(5) 6. Rank Updation
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 12, DECEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 103

TABLE 3 cardekho.com}. In this way, similarities for other queries


SAMPLE FRAGMENT OF LOG FOR PRACTICAL EVALUATION can be determined and clustering can be performed. The
final clusters obtained are: C1={q1,q2,q3,q4,q7,q8,
.UserId Query Clicked URL Click_
q10,q11,q12,q13} and C2= {q5, q6, q9, q14}. By increasing
Count
the value of, more precise clusters can be found.
1220051 Maruti Swift www.marutiswift.com 5
Price www.gaadi.com 10 4.2 Pattern Generation
www.Marutidzire.com. 3 For each cluster C1 and C2, corresponding URLs from the
1220051 Maruti Swift www.marutidzire.com 8 Query Cluster Database are extracted and sequential pat-
Dzire www.cardekho.com 3 terns are generated. Two columns (ClusterId and URL)
1220051 Maruti Swift www.marutiswift.com 5 need to be retrieved. To simplify the calculations, URLs
www.gaadi.com 6 are assigned to different variables. For example,
www.carwale.com 25 A=www.marutiswift.com,
1220051 Maruti Swift www.marutiswift.com, 5 B=www.gaadi.com,
Dzire Price www.marutidzire.com 5 C=www.marutidzire.com, and so on.
www. carwale.com 8 From the sample fragment, one pattern D→BE→A is
www.cardekho.com 10 found corresponding to C1.
1220051 Ray Ban sun- www.ray-ban.com 10
4.3 Weight calculation and Rank Updation
glasses www.hisunglasses.com 12
From the above result, a weight of each URL can be de-
1220051 Ray Ban sun- www.ray-ban.com 15
termined. Below are the weights of each URL in the pat-
glasses India www.emporiumonet.com 2
tern:
1220052 Maruti Swift www.marutiswift.com 14
Weight(D)= 1.09, Weight(B)= 0.549
Price www.gaadi.com 10
Weight(E)= 0.549, Weight(A)= 0.366
The Table 4 shows the optimized rank values of the
www.carwale.com 23
pages. The improved page rank is used only for the result
1220052 Maruti Swift auto.indiacar.com 21
presentation and does not affects the page rank stored in
1220052 Ray Ban sun- www.eyeweartown.com 7
the search engine’s repository.
glasses India www. ray-ban.com 9
price www.apparell.shop.ebay.in 12 TABLE 4
RANK MODIFICATION RELATIVE TO PAGE WEIGHTS
1220053 Maruti Swift www.marutiswift.com 11
Page
Price www.gaadi.com 10 Previous Rank Weight New Rank
/URL.
www.carwale.com 2
D 5 1.099 6.099
1220054 Maruti Swift www.cardekho.com, 9
B 4 0.549 4.549
Dzire Price www. marutisuzuki.com 14
E 6 0.549 6.549
www.marutitruevalue.com 15
A 4 0.366 4.366
1220054 Maruti Swift www.gaadi.com 11
….. …….. ……. …….
www.carwale.com 10
1220054 Maruti Swift www.marutiswift.com, 11 The optimization carried out on a set of 30 search re-
Price www.marutisuzuki.com 12 sults corresponding to user query “Maruti Swift” is
www.marutitruevalue.com 5 shown in Fig. 8. The old as well as new ranks are depicted
1220054 Ray Ban sun- www.ray-ban.com, 5 here.
glasses www.apparell.shop.ebay.in 6
www.emporiumonet.com 7 12

….. …….. ……. ….. 10

8
Page Rank

In the third and fifth function, α, β and  are set to 0.5 6


and min_sup is taken to be 2.
4

4.1 Similarity and Clustering Calculations 2

Suppose we want to calculate the similarity between the 0


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
first 2 queries. Let us say, Page Order

q1= Maruti Swift Price and q2= Maruti Swift Dzire. Old Rank New Rank
Simkeyword(q1,q2)=2/3 and Simclick(q1,q2)=11/29= 0.379
Simcombined(q1,q2)=(0.5)(2/3)+(0.5)(11/29)= 0.5228 Fig. 8. Rank Improvement of Pages for Query “Maruti Swift”.
Since the value of  is 0.5, both are grouped in the same It may be observed that some pages retain the same
cluster C1 and stored in Query Cluster Database with rank as before, while the pages which are most frequently
keywords {maruti, swift, dzire, price} and a set of URLs accessed by users exhibit a change in their rank values. It
{www.marutidzire.com, marutiswift.com, gaadi.com, can be evaluated from these results that the ranking of
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 12, DECEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 104

many web pages may be modified. Thus, more relevant international conference on Knowledge discovery and data
mining, pages 133–142, New York.
Web pages can be presented on the top of the result list
[12] S. Brin and L. Page. The anatomy of a large-scale hypertextual
according to the above implementation. Web search engine. Computer N/ws and ISDN Systems,
pp:107–117, 1998.
5 CONCLUSION AND FUTURE SCOPE [13] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank
citation ranking: Bringing order to the web. Technical report,
A novel approach based on query log analysis has been Stanford Digital Libraries SIDL-WP-1999-0120, 1999.
proposed for implementing effective web search. The [14] K. Bharat and G..A. Mihaila, “When Experts Agree: Using Non-
most important feature is that the result optimization me- Affiliated Experts to Rank Popular Topics,” ACM Transactions
on Information Systems, Vol. 20, No. 1, pp. 47-58, 2002.
thod is based on users’ feedback, which determines the
[15] D. Zhang and Y. Dong, “An Efficient Algorithm to Rank Web
relevance between Web pages and user query words. Resources,” In Proceedings of 9th International World Wide
Since result improvement is based on the analysis of Web Conference, pp. 449-455, 2000.
query logs, the recommendations and the returned pages [16] B. Amento, L. Terveen, and W. Hill, “Does Authority Mean
are mapped to the user feedbacks and dictate higher re- Quality? Predicting Expert Quality Ratings of Web
Documents”, In Proceedings of 23th International ACM SIGIR,
levance than the pages, which exist in the result list but pp. 296-303, 2000
are never accessed by the user. By this way, the time user [17] Srikant R., and Agrawal R. “Mining Sequential Patterns:
spends for seeking out the required information from Generalizations and performance improvements”, Proc. of 5th
search result list can be reduced and the more relevant International Conference Extending Database Technology
Web pages can be presented. (EDBT), France, March 1996.
The results obtained from practical evaluation are [18] R. Agrawal and R. Srikant. Mining sequential patterns. Proc. of
the 11th International Conference on Data Engineering
quite promising in improving the effectiveness of interac- (ICDE’95), pp. 3–14.
tive web search engines. Further studies may result in [19] Murat Ali Bayir, Ismail H. Toroslu, Ahmet Cosar. Performance
more advanced mining mechanisms which can provide Comparison of Pattern Discovery Methods on Web Log Data.
more comprehensive information about relevancy of the Proceedings of AICCSA, 2006, pp: 445-451.
query terms and allow identifying user’s information [20] A. K. Sharma, Neelam Duhan, Neha Aggarwal, Ranjana Gupta.
Web Search Result Optimization by Mining the Search Engine
need more effectively. logs. Proceedings of International Confeernce on Methods and
Models in Computer Science (ICM2CS-2010), JNU, Delhi, India,
REFERENCES Dec. 13-14, 2010.  
[1] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S.
Raghavan, “Searching the Web,” ACM Transactions on Internet Neelam Duhan received her B.Tech. degree in Computer Science &
Technology, Vol. 1, No. 1, pp. 97-101, 2001 Engineering with Hons. from Kurukhetra University, Kurukshetra in
[2] S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. 2002 and M.Tech. degree with Hons. in Computer Engineering from
Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining Maharshi Dayanand University, Rohtak in 2005. Presently, she is
the Web’s link structure. Computer, 32(8):60–67, 1999. working as Assistant Professor in Computer Engineering Department
in YMCA University of Science & Technology, Faridabad. She is
[3] Neelam Duhan, A. K. Sharma, Komal Kumar Bhatia. Page Ranking pursuing Ph.D. in Computer Engineering from Maharshi Dayanand
Algorithms: A Survey. In proceedings of the IEEE International University, Rohtak and has a teaching experience of 7 years. Her
Advanced Computing Conference (IACC’09), Patiala, India, 6-7
areas of interest are Databases, Data Mining, Search Engines and
March 2009, pp: 1530-1537.
Web Mining.
[4] A. Borchers, J. Herlocker, J. Konstanand, and J. Riedl,“Ganging
up on information overload,” Computer, Vol. 31, No. 4, pp. Prof. A. K. Sharma received his M.Tech. in Computer Science &
106-108, 1998. Technology with Hons. from University of Roorkee (Presently I.I.T.
[5] Edgar Meij, Marc Bron, Bouke Huurnink, Laura Hollink, and Roorkee) in 1989 and Ph.D (Fuzzy Expert Systems) from JMI, New
Maarten de Rijke. Learning semantic query suggestions. In 8th Delhi in the year 2000. He obtained his second Ph.D. in IT from
International Semantic Web Conference (ISWC 2009). Springer, IIITM, Gwalior in 2004. His research interests include Fuzzy Sys-
October 2009. tems, Object Oriented Programming, Knowledge representation and
[6] Doug Beeferman and Adam Berger, 2000. Agglomerative Internet Technologies. Presently he is working as the Dean, Faculty
clustering of a search engine query log. In Proceedings of the of Engineering and Technology & Chairman, Dept of Computer En-
6th ACMSIGKDD International Conference on Knowledge gineering at YMCA University of Science and Technology, Farida-
Discovery and Data Mining, (August). Acm Press, New York, bad. His research interest includes Fuzzy Systems, OOPS, Know-
NY, 407–416. ledge Representation and Internet Technologies. He has guided 9
[7] J. Wen, J. Mie, and H. Zhang. Clustering user queries of a Ph.D. thesis and 8 more are in progress with about 175 research
search engine. In Proc.at 10th International World Wide Web publications in International and National journals and conferences.
Conference. W3C, 2001. He is the author of 7 books. Besides being member of many BOS
and Academic councils, he has been Visiting Professor at JMI, IIITM,
[8] K. Hofmann, M. de Rijke, B. Huurnink, E. Meij. A Semantic
and I.I.T. Roorkee.
Perspective on Query Log Analysis, In Working notes for the
CLEF 2009 Workshop, Cortu, Greece.
[9] H. Ma, H. Yang, I. King, and M. R. Lyu. Learning latent
semantic relations from clickthrough data for query suggestion.
In CIKM ’08: Proceeding of the 17th ACM conference on
Information and knowledge management, pages 709–718, New
York, NY, USA, 2008. ACM.
[10] Bernard J. Jansen and Udo Pooch. A review of web searching
studies and a framework for future research. J. Am. Soc. Inf. Sci.
Technol., 52(3):235–246, 2001.
[11] Thorsten Joachims. Optimizing search engines using
clickthrough data. : Proceedings of the 8th ACM SIGKDD

You might also like