Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
Modeling and Analyze the Deep Web: Surfacing Hidden Value

Modeling and Analyze the Deep Web: Surfacing Hidden Value

Ratings: (0)|Views: 20 |Likes:
Published by ijcsis
Focused web crawlers have recently emerged as an alternative to the well-established web search engines. While the well-known focused crawlers retrieve relevant web-pages, there are various applications which target whole websites instead of single web-pages. For example, companies are represented by websites, not by individual web-pages. To answer queries targeted at Websites, web directories are an established solution. In this paper, we introduce a novel focused website crawler to employ the paradigm of focused crawling for the search of
relevant websites. The proposed crawler is based on two-level architecture and corresponding crawl strategies with an explicit concept of websites. The external crawler views the web as a graph of linked websites, selects the websites to be examined next and invokes internal crawlers. Each internal crawler views the web-pages of a single given website and performs focused (page) crawling within that website. Our Experimental evaluation demonstrates that the proposed focused website crawler clearly outperforms previous methods of focused crawling which were
adapted to retrieve websites instead of single web-pages.
Focused web crawlers have recently emerged as an alternative to the well-established web search engines. While the well-known focused crawlers retrieve relevant web-pages, there are various applications which target whole websites instead of single web-pages. For example, companies are represented by websites, not by individual web-pages. To answer queries targeted at Websites, web directories are an established solution. In this paper, we introduce a novel focused website crawler to employ the paradigm of focused crawling for the search of
relevant websites. The proposed crawler is based on two-level architecture and corresponding crawl strategies with an explicit concept of websites. The external crawler views the web as a graph of linked websites, selects the websites to be examined next and invokes internal crawlers. Each internal crawler views the web-pages of a single given website and performs focused (page) crawling within that website. Our Experimental evaluation demonstrates that the proposed focused website crawler clearly outperforms previous methods of focused crawling which were
adapted to retrieve websites instead of single web-pages.

More info:

Published by: ijcsis on Jul 07, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

07/07/2011

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 6, June 2011
Modeling and Analyze the Deep Web: SurfacingHidden Value
SUNEET KUMAR
Associate Proffessor ;Computer Science Dept.Dehradun Institute of Technology,Dehradun,Indiasuneetcit81@gmail.com 
RAKESH BHARATI
Assistant Proffessor ;Computer Science Dept.Dehradun Institute of Technology,Dehradun,Indiagoswami.rakesh@gmail.com
ANUJ KUMAR YADAV
Assistant Proffessor ;Computer Science Dept.Dehradun Institute of Technology,Dehradun,Indiaanujbit@gmail.com
RANI CHOUDHARY
Sr. Lecturer;Computer Science Dept.BBDIT,Ghaziabad,Indiaranichoudhary04@gmail.com
 Abstract 
Focused web crawlers have recently emerged as analternative to the well-established web search engines. While thewell-known focused crawlers retrieve relevant web-pages, thereare various applications which target whole websites instead of single web-pages. For example, companies are represented bywebsites, not by individual web-pages. To answer queriestargeted at Websites, web directories are an established solution.In this paper, we introduce a novel focused website crawler toemploy the paradigm of focused crawling for the search of relevant websites. The proposed crawler is based on two-levelarchitecture and corresponding crawl strategies with an explicitconcept of websites. The external crawler views the web as agraph of linked websites, selects the websites to be examined nextand invokes internal crawlers. Each internal crawler views theweb-pages of a single given website and performs focused (page)crawling within that website. Our Experimental evaluationdemonstrates that the proposed focused website crawler clearlyoutperforms previous methods of focused crawling which wereadapted to retrieve websites instead of single web-pages.
 
 Keywords- Deep Web ; Link references ; Searchable Databases ;Site page-views.
I.
 
I
NTRODUCTION
 A.
 
The Deep Web
 
Internet content is considerably more diverse and thevolume certainly much larger than commonly understood.First, though sometimes used synonymously, the World WideWeb (HTTP protocol) is but a subset of Internet content. OtherInternet protocols besides the Web include FTP (file transferprotocol), e-mail, news, Telnet, and Gopher (most prominentamong pre-Web protocols). This paper does not considerfurther these non-Web protocols [1]. Second, even within thestrict context of the Web, most users are aware only of thecontent presented to them via search engines such as Excite,Google, AltaVista, or Northern Light, or search directoriessuch as Yahoo!, About.com, or LookSmart. Eighty-fivepercent of Web users use search engines to find neededinformation, but nearly as high a percentage site the inabilityto find desired information as one of their biggestfrustrations.[2] According to a recent survey of search-enginesatisfaction by market-researcher NPD, search failure rateshave increased steadily since 1997.[3] The importance of information gathering on the Web and the central andunquestioned role of search engines -- plus the frustrationsexpressed by users about the adequacy of these engines --make them an obvious focus of investigation. Our key findingsinclude:
 
Public information on the deep Web is currently 400to 550 times larger than the commonly defined WorldWide Web.
 
The deep Web contains 7,500 terabytes of information compared to 19 terabytes of informationin the surface Web.
 
The deep Web contains nearly 550 billion individualdocuments compared to the one billion of the surfaceWeb.
 
More than 200,000 deep Web sites presently exist.
 
Sixty of the largest deep-Web sites collectivelycontain about 750 terabytes of information --sufficient by themselves to exceed the size of thesurface Web forty times.
 
On average, deep Web sites receive 50% greatermonthly traffic than surface sites and are more highlylinked to than surface sites; however, the typical(median) deep Web site is not well known to theInternet-searching public.
 
The deep Web is the largest growing category of newinformation on the Internet.
 
Deep Web sites tend to be narrower, with deepercontent, than conventional surface sites.
 
Total quality content of the deep Web is 1,000 to2,000 times greater than that of the surface Web.
119http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 6, June 2011
 
Deep Web content is highly relevant to everyinformation need, market, and domain.
 
More than half of the deep Web content resides intopic-specific databases.A full ninety-five per cent of the deep Web is publiclyaccessible information -- not subject to fees or subscriptions.
 B.
 
 How Search Engines Work 
Search engines obtain their listings in two ways: Authorsmay submit their own Web pages, or the search engines"crawl" or "spider" documents by following one hypertext link to another. The latter returns the bulk of the listings. Crawlerswork by recording every hypertext link in every page theyindex crawling. Like ripples propagating across a pond,search-engine crawlers are able to extend their indices furtherand further from their starting points.The surface Webcontains an estimated 2.5 billion documents, growing at a rateof 7.5 million documents per day.[4a] The largest searchengines have done an impressive job in extending their reach,though Web growth itself has exceeded the crawling ability of search engines[5][6] Today, the three largest search engines interms of internally reported documents indexed are Googlewith 1.35 billion documents (500 million available to mostsearches),[7] Fast, with 575 million documents [8] andNorthern Light with 327 million documents.[9]Moreover, return to the premise of how a search engineobtains its listings in the first place, whether adjusted forpopularity or not. That is, without a linkage from another Webdocument, the page will never be discovered. But the mainfailing of search engines is that they depend on the Web'slinkages to identify what is on the Web. Figure 1 is a graphicalrepresentation of the limitations of the typical search engine.The content identified is only what appears on the surface andthe harvest is fairly indiscriminate. There is tremendous valuethat resides deeper than this surface content. The informationis there, but it is hiding beneath the surface of the Web
.
FIGURE
 
1.
 
SEARCH
 
ENGINES:
 
DRAGGING
 
A
 
NET
 
ACROSS
 
THE
 
WEB'S
 
SURFACE
 
HIDDEN
 
VALUE
 
ON
 
THE
 
WEB
II.
 
S
EARCHABLE
D
ATABASES
:
 
H
IDDEN
V
ALUE ON THE
W
EB
 How does information appear and get presented on theWeb? In the earliest days of the Web, there were relativelyfew documents and sites. It was a manageable task to post alldocuments as static pages. Because all pages were persistentand constantly available, they could be crawled easily byconventional search engines. In July 1994, the Lycos searchengine went public with a catalog of 54,000 documents.[10]Since then, the compound growth rate in Web documents hasbeen on the order of more than 200% annually! [11] Sites thatwere required to manage tens to hundreds of documents couldeasily do so by posting fixed HTML pages within a staticdirectory structure. However, beginning about 1996, threephenomena took place. First, database technology wasintroduced to the Internet through such vendors as Bluestone'sSapphire/Web (Bluestone has since been bought by HP) andlater Oracle. Second, the Web became commercializedinitially via directories and search engines, but rapidly evolvedto include e-commerce. And, third, Web servers were adaptedto allow the "dynamic" serving of Web pages (for example,Microsoft's ASP and the Unix PHP technologies). Figure 2represents, in a non-scientific way, the improved results thatcan be obtained by Bright-Planet technology. By firstidentifying where the proper searchable databases reside, adirected query can then be placed to each of these sourcessimultaneously to harvest only the results desired -- withpinpoint accuracy.
F
IGURE
2.
 
H
ARVESTING THE
D
EEP AND
S
URFACE
W
EB WITH A
D
IRECTED
Q
UERY
E
NGINE
 
Additional aspects of this representation will be discussedthroughout this study. For the moment, however, the keypoints are that content in the deep Web is massive --approximately 500 times greater than that visible toconventional search engines -- with much higher qualitythroughout
.
III.
 
S
TUDY
O
BJECTIVES
 To perform the study discussed, we used our technology inan iterative process. Our goal was to:
 
Quantify the size and importance of the deep Web.
120http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 6, June 2011
 
Characterize the deep Web's content, quality, andrelevance to information seekers.
 
Discover automated means for identifying deep Websearch sites and directing queries to them.
 
Begin the process of educating the Internet-searchingpublic about this heretofore hidden and valuableinformation storehouse
.
 A.
 
What Has Not Been Analyzed or Included in Results
 
This paper does not investigate non-Web sources of Internet content. This study also purposely ignores privateintranet information hidden behind firewalls. Many largecompanies have internal document stores that exceed terabytesof information. Since access to this information is restricted,its scale can not be defined nor can it be characterized. Also,while on average 44% of the "contents" of a typical Webdocument reside in HTML and other coded information (forexample, XML or Java script),[12] this study does notevaluate specific information within that code. We do,however, include those codes in our quantification of totalcontent. Finally, the estimates for the size of the deep Webinclude neither specialized search engine sources -- which maybe partially "hidden" to the major traditional search engines –nor the contents of major search engines themselves. Thislatter category is significant. Simply accounting for the threelargest search engines and average Web document sizessuggests search-engine contents alone may equal 25 terabytesor more [13] or somewhat larger than the known size of thesurface Web.
 B.
 
 A Common Denominator for Size Comparisons
 
All deep-Web and surface-Web size figures use both totalnumber of documents (or database records in the case of thedeep Web) and total data storage. Data storage is based on"HTML included" Web-document size estimates.[11] Thisbasis includes all HTML and related code information plusstandard text content, exclusive of embedded images andstandard HTTP "header" information. Use of this standardconvention allows apples-to-apples size comparisons betweenthe surface and deep Web. The HTML-included conventionwas chosen because:
 
Most standard search engines that report documentsizes do so on this same basis.
 
When saving documents or Web pages directly froma browser, the file size byte count uses thisconvention.All document sizes used in the comparisons use actual bytecounts (1024 bytes per kilobyte)IV.
 
A
NALYSIS OF
L
ARGEST
D
EEP
W
EB
S
ITES
 Site characterization required three steps:
 
Estimating the total number of records or documentscontained on that site.
 
Retrieving a random sample of a minimum of tenresults from each site and then computing theexpressed HTML-included mean document size inbytes. This figure, times the number of total siterecords, produces the total site size estimate in bytes.
 
Indexing and characterizing the search-page form onthe site to determine subject coverage.Estimating total record count per site was often notstraightforward. A series of tests was applied to each site andare listed in descending order of importance and confidence inderiving the total document count:
 
E-mail messages were sent to the webmasters orcontacts listed for all sites identified, requestingverification of total record counts and storage sizes(uncompressed basis); about 13% of the sitesprovided direct documentation in response to thisrequest.
 
Total record counts as reported by the site itself. Thisinvolved inspecting related pages on the site,including help sections, site FAQs, etc.
 
Documented site sizes presented at conferences,estimated by others, etc. This step involvedcomprehensive Web searching to identify referencesources.
 
Record counts as provided by the site's own searchfunction. Some site searches provide total recordcounts for all queries submitted. For others that usethe NOT operator and allow its stand-alone use, aquery term known not to occur on the site such as"NOT ddfhrwxxct" was issued. This approach returnsan absolute total record count. Failing these twooptions, a broad query was issued that would capturethe general site content; this number was thencorrected for an empirically determined "coveragefactor," generally in the 1.2 to 1.4 range [14].
 
A site that failed all of these tests could not bemeasured and was dropped from the results listing.V.
 
A
NALYSIS OF
S
TANDARD
D
EEP
W
EB
S
ITES
 Analysis and characterization of the entire deep Webinvolved a number of discrete tasks:
 
Estimation of total number of deep Web sites.
 
Deep web Size analysis.
 
Content and coverage analysis.
 
Site page views and link references.
 
Growth analysis.
 
Quality analysis.
 A.
 
 Estimation of Total Number of Sites
The basic technique for estimating total deep Web sitesuses "overlap" analysis, the accepted technique chosen for two
121http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->