(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 6, June 2011
•
Characterize the deep Web's content, quality, andrelevance to information seekers.
•
Discover automated means for identifying deep Websearch sites and directing queries to them.
•
Begin the process of educating the Internet-searchingpublic about this heretofore hidden and valuableinformation storehouse
.
A.
What Has Not Been Analyzed or Included in Results
This paper does not investigate non-Web sources of Internet content. This study also purposely ignores privateintranet information hidden behind firewalls. Many largecompanies have internal document stores that exceed terabytesof information. Since access to this information is restricted,its scale can not be defined nor can it be characterized. Also,while on average 44% of the "contents" of a typical Webdocument reside in HTML and other coded information (forexample, XML or Java script),[12] this study does notevaluate specific information within that code. We do,however, include those codes in our quantification of totalcontent. Finally, the estimates for the size of the deep Webinclude neither specialized search engine sources -- which maybe partially "hidden" to the major traditional search engines –nor the contents of major search engines themselves. Thislatter category is significant. Simply accounting for the threelargest search engines and average Web document sizessuggests search-engine contents alone may equal 25 terabytesor more [13] or somewhat larger than the known size of thesurface Web.
B.
A Common Denominator for Size Comparisons
All deep-Web and surface-Web size figures use both totalnumber of documents (or database records in the case of thedeep Web) and total data storage. Data storage is based on"HTML included" Web-document size estimates.[11] Thisbasis includes all HTML and related code information plusstandard text content, exclusive of embedded images andstandard HTTP "header" information. Use of this standardconvention allows apples-to-apples size comparisons betweenthe surface and deep Web. The HTML-included conventionwas chosen because:
•
Most standard search engines that report documentsizes do so on this same basis.
•
When saving documents or Web pages directly froma browser, the file size byte count uses thisconvention.All document sizes used in the comparisons use actual bytecounts (1024 bytes per kilobyte)IV.
A
NALYSIS OF
L
ARGEST
D
EEP
W
EB
S
ITES
Site characterization required three steps:
•
Estimating the total number of records or documentscontained on that site.
•
Retrieving a random sample of a minimum of tenresults from each site and then computing theexpressed HTML-included mean document size inbytes. This figure, times the number of total siterecords, produces the total site size estimate in bytes.
•
Indexing and characterizing the search-page form onthe site to determine subject coverage.Estimating total record count per site was often notstraightforward. A series of tests was applied to each site andare listed in descending order of importance and confidence inderiving the total document count:
•
E-mail messages were sent to the webmasters orcontacts listed for all sites identified, requestingverification of total record counts and storage sizes(uncompressed basis); about 13% of the sitesprovided direct documentation in response to thisrequest.
•
Total record counts as reported by the site itself. Thisinvolved inspecting related pages on the site,including help sections, site FAQs, etc.
•
Documented site sizes presented at conferences,estimated by others, etc. This step involvedcomprehensive Web searching to identify referencesources.
•
Record counts as provided by the site's own searchfunction. Some site searches provide total recordcounts for all queries submitted. For others that usethe NOT operator and allow its stand-alone use, aquery term known not to occur on the site such as"NOT ddfhrwxxct" was issued. This approach returnsan absolute total record count. Failing these twooptions, a broad query was issued that would capturethe general site content; this number was thencorrected for an empirically determined "coveragefactor," generally in the 1.2 to 1.4 range [14].
•
A site that failed all of these tests could not bemeasured and was dropped from the results listing.V.
A
NALYSIS OF
S
TANDARD
D
EEP
W
EB
S
ITES
Analysis and characterization of the entire deep Webinvolved a number of discrete tasks:
•
Estimation of total number of deep Web sites.
•
Deep web Size analysis.
•
Content and coverage analysis.
•
Site page views and link references.
•
Growth analysis.
•
Quality analysis.
A.
Estimation of Total Number of Sites
The basic technique for estimating total deep Web sitesuses "overlap" analysis, the accepted technique chosen for two
121http://sites.google.com/site/ijcsis/ISSN 1947-5500