Reaction Paper: A Scale for Crawler Effectiveness on the Client-Side Hidden Web

Reaction Paper: A Scale for Crawler Effectiveness on the Client-Side Hidden Web

Published by Benj Arriola
A reaction paper as part of the requirements of the University of Redlands MBA w/ Emphasis on information Systems - Information Systems Strategy Capstone Course - ISYS683W
A reaction paper as part of the requirements of the University of Redlands MBA w/ Emphasis on information Systems - Information Systems Strategy Capstone Course - ISYS683W

Published by: Benj Arriola on Jul 08, 2012
A Scale for Crawler Effectiveness onthe Client-Side Hidden Web
By: Benj Arriola
Article Report
July 8, 2012MBA ISYS683WUniversity of Redlands
Information Systems Strategy CapstoneProf. Mark Gruber
 Page 27/8/2012Text Search Information Retrieval
Article ReportISYS683W
University of Redlands
A Scale for Crawler Effectiveness on the Client-SideHidden Web
This report is a review of the academic paper under the same title:
 A Scale for Crawler Effectiveness of the Client-Side Hidden Web.
Research came from professors of theCommunications and Information Technologies Department at the University of A Coruña inSpain, published 2012 in the Computer Science and Information Systems Journal.
 This paper focuses on a comparison of technologies, mainly different software platforms of free and commercial web crawlers to test their effectiveness and in crawling the hidden web.The paper is academic in nature and like many science journal articles, it does not discuss thepractical or business application of this research and is written in a tone directed to theacademic audience where the application of these technologies are assumed to be known bythe readers.To get a better understanding of the paper, definitions will be discussed first; on what is acrawler and what is client-side hidden web, furthermore on its business application that wasnot tackled in the paper and the limitations that may give false notions to the average reader.Below is an outline of the flow of this report:
Web Crawlers
Client-Side Hidden Web
Business Application Significance of this Study
The Academic Paper by the Professors of University of A Coruña
The Conducted Experiments
The Results of the Research Paper
Conclusions of the Research Paper
Possible Wrong Deductions by Readers of the Paper
Inferior Crawlers are Inferior for a Reason
Not Crawling AJAX and Flash links
Lack of Research of Crawling Technologies
Crawling and Information Retrieval are Two Different Things
Lack of Knowledge of Google, Bing and Yahoo
Search Engine Robots & IP Addresses
Redirection Handling
Report Conclusion
Prieto, V. M., Alvarez M., Lopez-Garcia, R., Cacheda, F., University of Caruña, A Scale for Crawler Effectiveness on the Client-Side HiddenWeb. Computer Science and Information Systems, Vol. 9, No. 2, 561-583. (2012) ComSIS Consortium
 Page 37/8/2012Text Search Information Retrieval
Article ReportISYS683W
University of Redlands
What is a Crawler
Crawlers are simply software programs that visit pages through their URLs and the programcrawls or searches within the pages for other URLs to crawl and analyze until all pages areexhaustively crawled. Some crawlers may be limited to crawling HTML pages alone, whileothers also crawl other page assets such as images, videos, CSS files, JavaScript files andmore. Crawlers are also called spiders, robots, or simply bots.
What is The Client-Side Hidden Web
For every loaded URL in a web browser, a page can be created in real time on the serverwhich runs Server-Side Technologies. Conversely, every URL loaded in a browser can loadelements that may change the appearance or content of a webpage within the web browseritself, and these are Client-Side Technologies. Due to the number of technologies that build up
a webpage, not all information is readily “crawlable.” Crawlers are not necessarily client
devices or web browsers, they are software scripts trying to decipher code, mainly in HTML.And with the current web technologies such as JavaScript, Adobe Flash, Adobe ShockWave,Apple Quicktime, Real Technologies Real Player, AJAX, XML, and several other less popularclient-side technologies make it difficult for crawlers to gather all available data a webpagemay offer.
Business Application Significance of this Study
In the information age, more and more information is shared online, either publicly orprivately using the Internet. More common software applications for business have beenmoving to the Internet creating web-based applications using cloud technologies where thebase interface is through a web browser and may run on a large number of devices such ascomputers, mobile phones, tablets and others. Other applications may also use the cloud butnot necessarily through a web-browser, but a custom application that accesses the data fromthe cloud that expands the limitations of an applications in terms of possible allowed Internetprotocols and port numbers.With the greater utilization of the cloud, more data is stored on the internet and most of whichare web-based. This makes it more important to make data searchable. To be able to searchdata properly on the web, information must be appropriately saved and indexed which can beachieved by crawling the pages. Using crawlers with the best capabilities of crawling thehidden web decreases the limitations in format or method of content creation.The better content is crawled, the more complete the content that is indexed and searchablewhich can always help improve work efficiencies in a cloud environment.

