Professional Documents
Culture Documents
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 97
Abstract—Query Logs are important information repositories, which record user activities on the search results. The mining of
these logs can improve the performance of search engines. Search engines generally return long lists of ranked pages, finding
the desired information content from which is typical on the user end and therefore, search result optimization techniques come
into play. The proposed system based on learning from query logs predicts user information needs and reduces the seek time of
the user within the search result list. To achieve this, the method first mines the logs using a novel similarity function to perform
query clustering and then discovers the sequential order of clicked URLs in each cluster using the modified version of an
existing Sequential Pattern Mining algorithm. Finally, search result list is optimized by re-ranking the pages using the discovered
sequential patterns. The proposed system proves to be efficient as the user desired relevant pages occupy their places earlier
in the result list and thus reducing the search space. The paper also presents a query recommendation scheme towards better
information retrieval.
Index Terms—World Wide Web, Information Retreival, Query Log, Web Usage Mining, Ranking Algorithm.
—————————— ——————————
1 INTRODUCTION
commending him with popular historical queries. The requirement of a user could not be matched [16]. In this
paper has been organized as follows: Section 2 describes manner, the most relevant web pages to users’ query
the basis terminologies forming the basis of the proposed words will not be shown at the top of the search result
work and literature work done under them. Section 3 ex- list. Google uses PageRank Algorithm to calculate the
plains the proposed optimization system along with ex- rank score and combines the text matching scores of the
amples. The experimental evaluation of the work is given page at the query time to return the important and rele-
in section 4 and section 5 concludes the paper with some vant pages. The PageRank formula can be defined as:
discussion on future research. PR ( v )
PR ( u ) (1 d ) d
v B (u ) Nv
(1)
2 PRILIMINARIES AND RELATED WORK where u represents a web page, B(u) is the set of pages
2.1 Query Logs that point to u. PR(u) and PR(v) are rank scores of page u
The notion of Web Usage/Log Mining has been a subject and v, respectively. Nv denotes the number of outgoing
of interest since many years. In the context of search en- links of page v, d is called damping factor and is usually
gines, servers record an entry in the log for every single set to 0.85.
access they get corresponding to a query. Thus, the log
2.3 Sequential Pattern Detection
mainly contains users’ queries and corresponding clicked
URLs, as well as other information about their browsing The problem of mining sequential patterns of web pages
activities. The typical logs [5] of search engines include has been an active area of research as it helps in finding
the following entries: 1) User (session) IDs, 2) Query q the order in which the web pages get visited by the users.
issued by the user, 3) URL u accessed/clicked by the user AprioriAll [17, 18] finds all patterns but it is a three-phase
4) Rank r of the URL u clicked for the query q and 5) Time algorithm. It first finds all item-sets with minimum sup-
t at which the query has been submitted. An example port, transforms the database so that each transaction is
illustration of the query log is shown in Table 1. replaced by the set of all frequent itemsets contained in
the transaction, and then finds sequential patterns for the
TABLE 1 same. Empirical evaluation by the researchers indicates
A SAMPLE CLICKTHROUGH DATA FOR THE QUERY LOG that GSP [17, 19] is much faster than the AprioriAll algo-
rithm.
UserID Query Clicked URL r Time A critical look at the available literature indicates that
27098 Data mining datamining/typepad.com 6 2009-03-01 00:01:10 the modern day search engines are using different opti-
27098 Data mining dataoutsourcingindia.com 5 2009-03-01 00:01:10 mization measures [20] on their search results in a way or
48785 Database webopedia.com/database.html 5 2009-03-01 00:01:16 another so as to better satisfy the user needs, but the user
48785 Database en.wikipedia.org/wiki/database 7 2009-03-01 00:01:16 is still presented with either irrelevant search results or
48785 Database http://www.cmie.com 6 2009-03-01 00:01:16 problems of finding the required information within the
searchcrm.techtarget.com/defini search results. The proposed method gives prime impor-
25112 Web mining 5 2009-03-01 00:01:40
tion/Web-mining tance to the information needs of the user and optimizes
cse.iitb.ac.in/~soumen/mining- the rank values of returned web pages according to the
25112 Web mining 4 2009-03-01 00:01:40
the-web/ history of pages accessed by them.
….. …….. ………….. ... ……..
Step 4: The final approach is to optimize the user’s web patterns which were discovered offline. The Query Re-
search based on the outputs of the previous two steps. commender guides the user with similar queries with the
most famous query highlighted.
Optimization in step 4 is carried out by re-ranking the
The working and algorithms for different functional
search result list by modifying the already assigned rank
modules are explained in the next subsections.
score of the web pages using the discovered sequential
patterns. The rank updation improves the relevancy of a 3.2 Query Similarity Analyzer
web page based on its access history. Step 2 is used to This approach is based on two principles: similarity based
recommend the user with the most famous query along on the query keywords and cross-references. These prin-
with many similar queries for a better search. ciples are formulated below:
3.1 Proposed Architecture 3.2.1 Similarity Based on Query Keywords
The architecture for the proposed system is shown in Fig. If two user queries contain the same or similar terms, they
1, which consists of the following functional components: denote the same or similar information needs. The follow-
1. Similarity Analyzer ing formula is used to measure the content similarity be-
2. Query Clustering Tool tween two queries.
3. Favored Query Finder KW ( p, q )
4. Sequential Pattern Generator Simkeyword ( p, q )
5. Rank Updater kw( p ) kw(q )
(2)
6. Query Recommender where kw(p) and kw(q) are the sets of keywords in the
queries p and q respectively, KW(p, q) is the set of com-
Index
mon keywords in two queries.
Query
Matched 3.2.2 Similarity Based on User Feedback
Pages
Final
Results
Query Terms Two queries are considered to be similar if they share or
Interface of Query
result in the selection of same or similar documents. This
Query
Search Engine Processor
Query Log principle is based on Beeferman and Berger’s agglomera-
tive clustering algorithm [6]. The approach is content-
ignorant, which means that the algorithm does not make
Query
Recommendations
Matched
Pages
Optimized
Ranked Pages
Set of Queries
and use of the actual content of the queries and the documents
Clicked URLs
in clustering. Similarity analyzer first constructs a bipar-
tite graph with one set of vertices corresponding to que-
Query Rank
Similarity Analyzer ries, and the other corresponding to documents as shown
Recommender Updater
in Fig. 2.
Similarity
Values
Set of Q1 Q2 Q3 Q4
Patterns
Query Clustering Tool Query Set
Favored Query and Pattern
Similar Queries
10 100 10 10
Generator Query
Clusters 1000 10
10 100 100 1000
Favored Query
Query Cluster Database
Finder Query Clusters
Document Set
Matched Query Clusters D1 D2 D3 D4 D5 D6 D7
Fig. 1. Architecture of Proposed Optimization System Fig. 2. A Biparite Graph of Query Log.
When user submits a query on the search engine inter- A query vertex is joined with a document vertex if
face, the query processor component matches the query document has been accessed/clicked by a user corres-
terms with the index repository of the search engine and ponding to the said query. The numerical integer on each
returns a list of matched documents in response. On the edge dictates the number of accesses to the document by
back end, user browsing behavior including the submit- distinct users for a particular query. For example a value
ted queries and clicked URLs get stored in the logs and 10 between Q1 and D1 says that 10 users have clicked on
are analyzed continuously by the Similarity Analyzer D1 corresponding to Q1. In the figure above: D1, D2, D4
module, the output of which is forwarded to the Query are accessed with respect to Q1, thus are relevant to Q1
Clustering Tool to generate groups of queries based on and D2, D3, D4 are relevant to Q2 and so on. As Q1 and
their similarities. Favored Query Finder extracts most Q2 share two documents D2 and D4, they can be consi-
popular queries from each cluster and stores them for dered similar but similarity is decided on the basis of
future reference. The Pattern Generator module discovers number of document clicks.
sequential patterns of web pages in each cluster. The If two queries p and q share a common document d,
Rank Updater component works online and takes as in- then similarity value is ratio of the total number of dis-
put the matched documents retrieved by query processor. tinct clicks on d with respect to both queries and the total
It improves the ranks of pages according to sequential number of distinct clicks on all the documents accessed
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 12, DECEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 100
for both queries. If more than one document is shared, multiple clusters. Each returned cluster is stored in the
then numerator is obtained by summing up the document Query Cluster Database along with the associated que-
clicks of all common documents. The following formula ries, query keywords and the clicked URLs. The cluster-
dictates the similarity function based on document clicks: ing algorithm takes O(n2) worst case time to find all the
LC ( p, di) LC (q, di)
diCD ( p ) CD ( q )
query clusters, where n is the total number of queries.
SimclickURL ( p, q)
LC ( p, xi) LC (q, xi)
xiCD ( p ) CD ( q )
Algorithm: Query_Clustering(Q, α, β, )
Given: A set of n queries and corresponding clicked
(3)
where LC(p,d) and LC(q,d) are the number of clicks on URLs stored in an array Q[qi , URL1, ...,URLm],
document d corresponding to queries p and q respective- 1≤i≤n
ly. CD(p) and CD(q) are the sets of clicked documents cor- α = β = 0.5
responding to queries p and q respectively. Similarity threshold
As an example illustration, Q1 and Q2 share two Output: A set C= {C1, C2,…Ck} of k query clusters
common documents D2 and D4, while D1, D2, D3 and D4
are accessed either by Q1 or Q2 or both. The similarity // Start of Algorithm
between two queries is: k = 1; //k is the number of clusters.
(100 10) (1000 100) For (each query p in Q)
SimclickURL (Q1, Q 2) Set ClusterId(p)= Null; // initially no query is clustered
(10 0) (100 10) (0 10) (1000 100)
For (each p Q )
= 0.984
{
Similarly, the similarity between Q1 and Q3 is: ClusterId(p)= Ck;
1010 Ck={p};
SimclickURL (Q1, Q3) 0.455
2220 For (each q Q such that p q )
The similarity values always lie between 0 and 1. The {
measure given in (3) declares two queries similar by im- Simkeyword ( p, q)
KW ( p, q )
posing a threshold value on their similarity value. kw( p ) kw(q )
understand the algorithm. The figure shows the final ite- where lenpat(X) is the effective length/depth of the sequen-
ration of candidate generation and pruning phase. tial pattern in which X occurs and level(X) is the depth of
X in the pattern.
Frequent
Considering the example pattern of Fig. 7, the weights
3-Sequences of pages comes out to be:
Candidate Weight (A)= ln (3)/1= 1.099
<{A}{B}{C}> Generation
<{A}{BE}>
Weight (B)= 0.549 etc.
<{A}{E}{C}> <{A}{B}{C}{D}> Candidate
<{B}{C}{D}> <{A}{BE}{C}> Pruning
A
<{BE}{C}> <{A}{E}{CD}>
<{C}{D}{E}> <{B}{C}{D}{E}> <{A}{BE}{C}>
<{E}{CD}> <{BE}{CD}> B E
8
Page Rank
q1= Maruti Swift Price and q2= Maruti Swift Dzire. Old Rank New Rank
Simkeyword(q1,q2)=2/3 and Simclick(q1,q2)=11/29= 0.379
Simcombined(q1,q2)=(0.5)(2/3)+(0.5)(11/29)= 0.5228 Fig. 8. Rank Improvement of Pages for Query “Maruti Swift”.
Since the value of is 0.5, both are grouped in the same It may be observed that some pages retain the same
cluster C1 and stored in Query Cluster Database with rank as before, while the pages which are most frequently
keywords {maruti, swift, dzire, price} and a set of URLs accessed by users exhibit a change in their rank values. It
{www.marutidzire.com, marutiswift.com, gaadi.com, can be evaluated from these results that the ranking of
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 12, DECEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 104
many web pages may be modified. Thus, more relevant international conference on Knowledge discovery and data
mining, pages 133–142, New York.
Web pages can be presented on the top of the result list
[12] S. Brin and L. Page. The anatomy of a large-scale hypertextual
according to the above implementation. Web search engine. Computer N/ws and ISDN Systems,
pp:107–117, 1998.
5 CONCLUSION AND FUTURE SCOPE [13] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank
citation ranking: Bringing order to the web. Technical report,
A novel approach based on query log analysis has been Stanford Digital Libraries SIDL-WP-1999-0120, 1999.
proposed for implementing effective web search. The [14] K. Bharat and G..A. Mihaila, “When Experts Agree: Using Non-
most important feature is that the result optimization me- Affiliated Experts to Rank Popular Topics,” ACM Transactions
on Information Systems, Vol. 20, No. 1, pp. 47-58, 2002.
thod is based on users’ feedback, which determines the
[15] D. Zhang and Y. Dong, “An Efficient Algorithm to Rank Web
relevance between Web pages and user query words. Resources,” In Proceedings of 9th International World Wide
Since result improvement is based on the analysis of Web Conference, pp. 449-455, 2000.
query logs, the recommendations and the returned pages [16] B. Amento, L. Terveen, and W. Hill, “Does Authority Mean
are mapped to the user feedbacks and dictate higher re- Quality? Predicting Expert Quality Ratings of Web
Documents”, In Proceedings of 23th International ACM SIGIR,
levance than the pages, which exist in the result list but pp. 296-303, 2000
are never accessed by the user. By this way, the time user [17] Srikant R., and Agrawal R. “Mining Sequential Patterns:
spends for seeking out the required information from Generalizations and performance improvements”, Proc. of 5th
search result list can be reduced and the more relevant International Conference Extending Database Technology
Web pages can be presented. (EDBT), France, March 1996.
The results obtained from practical evaluation are [18] R. Agrawal and R. Srikant. Mining sequential patterns. Proc. of
the 11th International Conference on Data Engineering
quite promising in improving the effectiveness of interac- (ICDE’95), pp. 3–14.
tive web search engines. Further studies may result in [19] Murat Ali Bayir, Ismail H. Toroslu, Ahmet Cosar. Performance
more advanced mining mechanisms which can provide Comparison of Pattern Discovery Methods on Web Log Data.
more comprehensive information about relevancy of the Proceedings of AICCSA, 2006, pp: 445-451.
query terms and allow identifying user’s information [20] A. K. Sharma, Neelam Duhan, Neha Aggarwal, Ranjana Gupta.
Web Search Result Optimization by Mining the Search Engine
need more effectively. logs. Proceedings of International Confeernce on Methods and
Models in Computer Science (ICM2CS-2010), JNU, Delhi, India,
REFERENCES Dec. 13-14, 2010.
[1] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S.
Raghavan, “Searching the Web,” ACM Transactions on Internet Neelam Duhan received her B.Tech. degree in Computer Science &
Technology, Vol. 1, No. 1, pp. 97-101, 2001 Engineering with Hons. from Kurukhetra University, Kurukshetra in
[2] S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. 2002 and M.Tech. degree with Hons. in Computer Engineering from
Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining Maharshi Dayanand University, Rohtak in 2005. Presently, she is
the Web’s link structure. Computer, 32(8):60–67, 1999. working as Assistant Professor in Computer Engineering Department
in YMCA University of Science & Technology, Faridabad. She is
[3] Neelam Duhan, A. K. Sharma, Komal Kumar Bhatia. Page Ranking pursuing Ph.D. in Computer Engineering from Maharshi Dayanand
Algorithms: A Survey. In proceedings of the IEEE International University, Rohtak and has a teaching experience of 7 years. Her
Advanced Computing Conference (IACC’09), Patiala, India, 6-7
areas of interest are Databases, Data Mining, Search Engines and
March 2009, pp: 1530-1537.
Web Mining.
[4] A. Borchers, J. Herlocker, J. Konstanand, and J. Riedl,“Ganging
up on information overload,” Computer, Vol. 31, No. 4, pp. Prof. A. K. Sharma received his M.Tech. in Computer Science &
106-108, 1998. Technology with Hons. from University of Roorkee (Presently I.I.T.
[5] Edgar Meij, Marc Bron, Bouke Huurnink, Laura Hollink, and Roorkee) in 1989 and Ph.D (Fuzzy Expert Systems) from JMI, New
Maarten de Rijke. Learning semantic query suggestions. In 8th Delhi in the year 2000. He obtained his second Ph.D. in IT from
International Semantic Web Conference (ISWC 2009). Springer, IIITM, Gwalior in 2004. His research interests include Fuzzy Sys-
October 2009. tems, Object Oriented Programming, Knowledge representation and
[6] Doug Beeferman and Adam Berger, 2000. Agglomerative Internet Technologies. Presently he is working as the Dean, Faculty
clustering of a search engine query log. In Proceedings of the of Engineering and Technology & Chairman, Dept of Computer En-
6th ACMSIGKDD International Conference on Knowledge gineering at YMCA University of Science and Technology, Farida-
Discovery and Data Mining, (August). Acm Press, New York, bad. His research interest includes Fuzzy Systems, OOPS, Know-
NY, 407–416. ledge Representation and Internet Technologies. He has guided 9
[7] J. Wen, J. Mie, and H. Zhang. Clustering user queries of a Ph.D. thesis and 8 more are in progress with about 175 research
search engine. In Proc.at 10th International World Wide Web publications in International and National journals and conferences.
Conference. W3C, 2001. He is the author of 7 books. Besides being member of many BOS
and Academic councils, he has been Visiting Professor at JMI, IIITM,
[8] K. Hofmann, M. de Rijke, B. Huurnink, E. Meij. A Semantic
and I.I.T. Roorkee.
Perspective on Query Log Analysis, In Working notes for the
CLEF 2009 Workshop, Cortu, Greece.
[9] H. Ma, H. Yang, I. King, and M. R. Lyu. Learning latent
semantic relations from clickthrough data for query suggestion.
In CIKM ’08: Proceeding of the 17th ACM conference on
Information and knowledge management, pages 709–718, New
York, NY, USA, 2008. ACM.
[10] Bernard J. Jansen and Udo Pooch. A review of web searching
studies and a framework for future research. J. Am. Soc. Inf. Sci.
Technol., 52(3):235–246, 2001.
[11] Thorsten Joachims. Optimizing search engines using
clickthrough data. : Proceedings of the 8th ACM SIGKDD