You are on page 1of 9

CANDIDATES DECLARATION

I hereby certify that the work, which is being presented in this thesis titled Downloading the hidden web content for multi-attribute query interface using web log techniques, in the partial fulfillment of the requirement for the degree of Master of Technology in Information Technology and submitted to YMCA University of Science and Technology, Faridabad, is an authentic record of my own work carried out under the supervision of Dr. A.K. Sharma and Dr Komal Kr Bhatia. The matter presented in this thesis has not been submitted by me for the award of any other degree of this institute or any other university.

(NIS HA CHHABBRA)

ii

CERTIFICATE
This is to certify that the thesis titled Downloading the hidden web content using multiattribute query interface using web log techniques which is being submitted by NISHA CHHABRA to YMCA University of Science and Technology, Faridabad, for the award of the degree of Master of Technology in Information Technology is a record of bonafide work carried out by her under my supervision. In my opinion, the thesis has reached the standards of fulfilling the requirements of the regulations to the degree. The work contained in this thesis has not been submitted to any other university or institute for the award of any other degree or diploma.

Komal Kr.Bhatia (Co-Supervisor) Department of Computer Engineering YMCA University of Science and Technology, Faridabad.

A.K.Sharma (Supervisor) Department of computer Engineering YMCA University of Science and Technology, Faridabad.

A.K.Sharma Chairman Computer Engg. Deptt. YMCA University of Science and Technology, Faridabad.

iii

ACKNOWLEDGEMENT
It is with deep sense of gratitude and reverence that I express my sincere thanks to my cosupervisor Dr.Komal Kr.Bhatia for their guidance, encouragement, help and valuable suggestions throughout. Her untiring and painstaking efforts, methodical approach and individual help made it possible for me to complete this work in time. I consider myself very fortunate for having been associated with a scholar like her. Her affection, kindness and scientific approach served a veritable incentive for completion of this work. I am grateful to Dr. A. K. Sharma, Chairman of Computer Engineering Department, YMCA University of Science and Technology, Faridabad for the guidance, encouragement, support and valuable suggestions provided by him. I shall ever remain indebted to the faculty members of Y.M.C.A University of Science and Technology, Faridabad and Apeejay Stya University for their cooperation, kindness and general help extended to me during the completion of this thesis. My Parents, Mr. Ramesh Chhabra and Mrs. Savitri Chhabra and my Brother, Mr. Chirag Chhabra deserves special mention for their incomparable support and prayers. Their support and guidance when I face difficulties help me get through exhausted days. I am grateful to everyone mentioned here and everyone else who has provided help and support in some way or other.

(Nisha Chhabra)

iv

ABSTRACT
Traditional crawlers normally follow links on the Web to discover and download pages. Therefore they cannot get to the Hidden Web pages which are only accessible through query interfaces. In particular, a large part of the Web is hidden behind search forms and is reachable only when users type in a set of keywords, or queries, to the forms. These pages are often referred to as the Hidden Web or the Deep Web , because search engines typically cannot index the pages and do not return them in their results thus, the pages are essentially hidden from a typical Web user. There exists a variety of Hidden Web sources that provide information on a multitude of topics. Depending on the type of information, we may categorize a Hidden-Web site either as a textual database or a structured database. We studied how we can build a Hidden Web crawler that can automatically query a textual database from hidden website and download pages from it. The only entry point to a Hidden Web site is a query interface, the main challenge that a Hidden Web crawler has to face is how to automatically generate meaningful queries to issue to the site. We intend to carry out same for structured or multi attribute database. In multi-attribute queries, the site often returns pages that contain values for each of the query attributes. For example, when an online bookstore supports queries on title, author and isbn, the pages returned from a query typically contain the title, author and ISBN of corresponding books. Query Logs are important information repositories, which record user activities on the search results. The logs typically include the following entries: user id, query issued by the user, url accessed / clicked by user, rank of url ,and time at which query has been submitted.Whenever a user issue any query the query is matched with the logs maintained for that particular site so as to check whether such queries has been issued earlier or not. If the query has been issued earlier then similarity value of the queries is calculated with those issued earlier, and if similarity value turns out to be above the threshold then these queries are clustered together. After that we will find the popular queries for those clusters. Then we will calculate the we will calculate the weight by checking the no. of IP addresses who have fired that query to the total no. of IP

addresses in the that cluster and if the weight is above the threshold then we can say that the query is favored.

vi

TABLE OF CONTENTS
CANDIDATE DECLARATION CERTIFICATE ACKNOWLEDGEMENT ABSTRACT TABLE OF CONTENTS LIST OF FIGURES LIST OF TABLES 1. INTRODUCTION 1.1 GENERAL 1.2 SEARCH ENGINE 1.3 HIDDEN WEB 1.4 PROBLEM IDENTIFICATION 1.5 ORGANISATION OF THESIS 2. SEARCH ENGINE AND THEIR STRENGTHS AND WEAKNESSES 2.1 INTERNET 2.2 SEARCH ENGINE 2.2.1 TYPES OF DATA TO BE SEARCHED BY SEARCH ENGINE 2.3 TYPES OF SEARCH ENGINE 2.4 GENERAL ARCHITECTURE OF SEARCH ENGINE 2.5 WEB CRAWLERS AND TYPES OF CRAWLERS 2.6 EXTRACTION OF INFORMATION FROM HIDDEN WEB 2.7 THE HIDDEN WEB CRAWLING 2.7.1 2.7.2 2.8.1 2.8.2 2.8.3 2.8.4 TASK SPECIFICITY HUMAN ASSISTANCE INTERNAL FORM REPRESENTATION TASK SPECIFIC DATABASE MATCHING FUNCTION LABEL MATCHING ii iii iv v vi viii x 1 1 4 5 9 9 11 11 12 14 15 17 20 27 28 28 29 30 32 32 32 33

2.8 ARCHITECTURE OF HIDDEN WEB CRAWLER

vii

2.8.5

SUBMISSION EFFICIENCY

33 34 50 50 52 53 56 57 58 59

2.9 SEARCH ENGINES AND THEIR STRENGTH AND WEAKNESSES 3. DOWNLOAING THE HIDDEN WEB CONTENT: A REVIEW 3.1 INTRODUCTION 3.2 KEYWORD SELECTION 3.3 ESTIMATING THE NUMBER OF MATCHING PAGES 3.4 QUERY LOGS 3.5 Similarity Based on User Feedback 3.6 Similarity Based on Query Keywords 3.7 Favored Query Finder

4. DOWNLOADING THE HIDDEN WEB CONTENT FOR MULTI- 61


ATTRIBUTE QUERY INTERFACE

USING

WEB

LOG
61 62 65 68 68 68 69

TECHNIQUES
4.1 Introduction 4.2 Downloading the hidden web content for multi-attribute query interface using web log techniques 4.3 ILLUSTRATIVE EXAMPLE 5. CONCLUSION AND FUTURE SCOPE 5.1 Conclusion 5.2 Future scope REFERENCES

viii

LIST OF FIGURES
Figure 2.1 Figure 2.2 Figure2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 Figure 2.8 Figure 2.9 Figure 2.10 Figure 2.11 Figure 2.12 Figure 2.13 Figure 2.14 Figure 2.15 Figure 3.1 Figure 3.2 Figure 4.1 User form interaction General architecture of search engine High level architecture of standard web General architecture of parallel crawler General architecture of focused crawler Architecture of form focused crawler Crawler form interaction HiWE architecture Google architecture Repository Forward and reverse indexes and the lexicon The search interface of google Alta vista architecture Search interface of Alta vista Harvest architecture Zip curve Bipartite graph of query log Bipartite graph 17 18 21 23 24 25 28 30 34 35 38 44 45 46 48 55 57 62

ix

LIST OF TABLES
Table 3.1 Table 3.2 Table 3.3 Table 4.1 Probability calculation Query log table Query log example Illustrative example of Log table 54 56 57 64