Professional Documents
Culture Documents
Web Crawlers
To Detect Security Holes
x Web crawlers got new look, many designs changed, 1) Focused Crawling: It is to selectively seek out pages
and spider term was started using for it. (Sept 4th that are relevant to a pre-defined set of topics emphasizing on
1995) relevant links and skipping irrelevant links [6] [8].
x In 2008 web crawler consists of about 2.5 billion 2) Intelligent Crawling: It is actually referred to learning
pages, with a total of 19 terabyte data. characteristics of the linkage structure of the World Wide
x Now-a-days different types of web crawling Web while performing the crawling. No specific model for
techniques are in use such as focused crawling and web linkage structure is assumed [9].
intelligent crawling.
D. Implementation of a Typical Crawler?
After 1995 web crawlers were recognized, they kept on After knowing behavior and working of a crawler along
moving and getting different names according to their crawling types, we need to know the basic components of a
functionality and changing behavior. They got different crawler to move to the implementation of a typical crawler. A
designs, sponsors, companies and different ways of searching crawler contains two basic components; one is Crawling
the related links and data [9]. Application which handles request, issues URL‟s according to
the request, and second is Crawling System which downloads
B. Why a Crawler is Needed? pages and supplies to crawling application [10][6].
These days the web is absolutely the dynamic content and For the development of a basic crawler, two crawlers were
information is created when it is needed. Web pages update studied illustrating evolution from basic to an improved one.
repeatedly so, complications and the changing web This study was intended to get an idea of the architecture of an
environment has affected the crawlers to evolve. In the elementary crawler with fundamental features [11].
beginning, a subset of the documents was crawled on the web
with the modified breadth-first traversal and it was sufficient. 1) The First Crawler: Design of the first crawler is
But more recently the scale, complexity, and dynamic nature elaborated in the Fig.1.The first crawler had four fundamental
of the Web have changed many of the requirements making parts: a database to store state, a set of agents to retrieve
the crawling process more difficult. pages from the Web, a dispatcher to coordinate the agents and
database, an indexer to update the full-text index with newly
That’s why articulated crawlers were needed to cope with
retrieved documents. The basic first web crawler has only one
the problem.
policy that is collection policy.
As per the Collection Policy for first web crawler, it
C. Behavior and Working of a Crawler
determines which URLs are to be collected and submitted for
The behavior of a web crawler is the outcome of a indexing, and which are not. Collection policy was made to
combination of following policies; first one is Selection
distinguish related documents from unrelated, to increase
policy that states which pages to download, Re-visit policy
states when to check for changes to the pages, Politeness efficient search by providing indexing in a hypertext domain
policy states how to avoid overloading web sites, and avoiding the creation of a larger index [1]. Web
Parallelization policy states how to coordinate distributed Crawler’s algorithm was in fact a simple breadth-first
web crawlers. traversal, done at the level of the server table, not the
document table.
The basic algorithm on which web crawler operates is
quite simple. From a set of candidate URLs, the web crawler
selects one URL then downloads the web pages associated to
that selected URL, extract the URL contained in that web
page and then add such URLs in candidate set, that have not
been encountered yet [2]. Once it has been determined that a
URL has not been previously discovered, it is added to the
frontier set containing the URLs that have yet to be
downloaded. Frontier set is generally too large. Generally the
URLs must be crawled in such a way that maximizes the
utility of the crawled amount. A high-quality, highly-
demanded and fast-changing page is re-crawled frequently,
while high-quality but slow-changing and fast-changing but
low-quality pages receive a lower priority. To implement a Fig 1. Design of the First Crawler
web crawler on large scale can be quite complex. At large
scale formerly discussed polices and things are include in the Traversal at server level was chosen because documents
crawler and they also follow the robot .txt standard while on a particular server are likely to be related with structured,
crawling pages. Politeness policy controls the amount of easy-to-follow links among them. So, getting a few
traffic given to any particular web server [2].Web crawlers representative documents from each server is a
have basically two ways of crawling: straightforward way to achieve Web Crawler’s collection
policy. Visiting servers in a breadth-first fashion
134
2013 International Conference on Open Source Systems and Technologies (ICOSST)
automatically creates delays between subsequent visits to a information would also be obtained for them, and that data
single server is the other one. These delays are appreciated by would be used to select the best documents for the collection
server administrators [1]. [1].
E. Requirements to be a good Crawler
a) Problems and Challenges for First Web Crawler:
This first crawler worked fine for six months. But, as the size A good crawler must fulfill these requirements; Flexibility
of the web increased, the implementation faced three major that Means the system can be used in variety of scenarios,
problems of fault tolerance, scale and politeness. The most Low cost and high performance [7], Robustness, able to
severe of these problems was fault tolerance. Scale also tolerate crashes and network interruptions without losing too
became an issue with the growth of web. The third problem much of data, Speed Control not putting too much load on
with the early crawler was that, although it was relatively single server while downloading and it should be manageable
polite to individual servers, it did not obey the Standards for and reconfigurable [10].
Robot Exclusion [1].
F. How crawlers are categorized?
b) Modification to releive the Problem: Keeping in
mind the fault tolerance problem, the first modification made While studying the typical crawler implementation, two
to the first crawler was not to index when in real time while categories of crawlers; Distributed and Agile rendered. We
crawling, and just to put the raw HTML into the file system present the design of both categories separately.
as it was retrieved from the web. Thus indexing and crawling 1) Distributed web crawler: A distributed web crawler is
processes were separated and this prohibited failures in one of designed such that it scales to hundreds of pages per second.
both from affecting the stability of the other. But, later Fig. 3 elaborates design of a distributed web crawler. It
experience with this modified crawler too showed problems consists of Crawl manager, Crawling Application, Domain
with fault tolerance, especially as the database grew large Name Server (DNS) resolver and downloader.
than before. On the other hand scale and politeness still A Crawl manager is the basic component of distributed
remained a problem [1].
crawler and visible to all components. It receives requests for
2) The Second Crawler: The second crawler was URL’s, after enqueueing, loads the matching file to start
designed to evade some of the problems come across the first download. After loading the URLs the manager requests the
Crawler and the issues of fault tolerance, scale, and flexibility DNS resolver for the IP addresses of the servers and then it
were particularly addressed. The architecture of this new requests the robots.txt file in the web server root directory.
system was almost same based on the old one, but with a few Finally, after parsing robots file and removing excluded
important differences shown in Fig. 2 [1].This crawler still URLs, the requested URLs are sent into batches to
had a database in its central part, however this time the downloader.
database was implemented using a commercial, full-scale
database, Oracle. The database was encircled by a number of
key processes. The main dispatcher process retrieved pages to
index from the database, and passed them out to a series of
agents. These agents retrieved pages from the web, and
passed their results back to an updater process. The updater
was responsible to queue the pages for indexing and updating
the database with the new metadata from the pages’ retrieval.
Fig. 2. Design of the Second Crawler Fig. 3. Design of the Distributed Crawler
A new strategy for the collection was made which said that Downloader with adjustable speed receives more than
many documents would be crawled and metadata and link hundreds of pages per second and a large number of pages
135
2013 International Conference on Open Source Systems and Technologies (ICOSST)
have to be written out. The Crawling Application checks form of tables after going through the details of the crawlers
downloaded pages for hyperlinks and if they have not been which are mostly open source or their documentation was
visited already then sent to the manager in the form of available. Comparison Criteria include:
batches. Performance, parsing and network speed of
distributed web crawler can be scaled up by adding additional
low-cost components [10]. x Searching Mechanism which consists of Indexing and
ranking methods are considered as the searching
2) Agile web Crawler: For agility of a web crawler a mechanism. Table 1 at [22] sums up the comparison
drawn based on searching mechanism among the
single database is used (My Sql).Fig. 4 is the crawling cycle
crawler-like software packages.
functional diagram which shows that when a web crawler get
a URL to read it saves a cookie and asks for the source code x Crawler and Indexer Features which contain
of that URL. After reading the source code if it have an error functionalities of built-in web crawlers and indexers such
then it shows the error message and save the status of this as Robot Exclusion Standard Support, Crawler Retrieval
process but if it read correctly then it extracts the data and Depth Control, Duplicate Detection, File Format to be
checks its database limit and save that link, save the source indexed and Index Protected Server. Table 2 at [22]
code and then terminate the process while saving the status. compares the crawlers mentioned above based upon
In the database the saved links are saved on the basis of crawler and indexer features.
page’s popularity ranking, most visited on top then on second x Searching features, considered from nine aspects;
number and so on. Agility of a web crawler is easy to modify Boolean Search, Phrase Matching, Attribute Search,
at crucial level to extend system quickly, without losing time Fuzzy Search, Word Forms, Wild Card, Regular
on modifying entire application. Agility of a web crawler Expression, Numeric Data Search, Case Sensitivity,
allows robustness and speed, manageability, good structural Nature Language Query. Table 3 at [22] shows the
analysis, correct data extraction and a fast search system [15]. comparison among the crawler software based on
searching features.
x Other Features are International Language, Page Limit,
and Customizable Result Formatting. Table 4 at [22]
contains comparison and contrast based on the formerly
mentioned attributes.
H. Security Detection Tools
During research, we found different tools which ensure
application security but these are desktop based or simple
plug-ins. They don’t go with the idea we propose. These
security tools include, Exploit me, Watcher, N-Stalker free
version, Netsparker community Edition, Web Security,
Wapiti, SkipFish, Scrawl, X5s, WebScarab and Acunetix.[17]
These tools are ranked in Table 5 at [22] based upon the
supporting operating systems, functionalities and their types.
136
2013 International Conference on Open Source Systems and Technologies (ICOSST)
implementation level of the web applications, Study of targeted in our study from crawler’s point of view are
programs related to our work, Implementation of basic discussed here.
crawler and proposing design and architecture of the
prototype intended to detect the security holes. 1) ROBOTS.txt: It’s a file mentioned in the “meta‟ tag of
web pages. Robots.txt [21] allows or disallows bots to
Right after motivation and proposing a research plan we came traverse through specific directories. The commercial bots,
across the question, “How a web crawler is supposed to the bots like Google bot, Microsoft’s bot and other bots
secure online assets?” which follow the legal rules; ethically follow the robots.txt. It
The question posed, drove our attention because today, the is totally on the developer’s behalf that he programs his bot to
web is the dynamic content; the information created at the follow the rules or not. The bad bots usually skip the
same time as it is needed i.e. the resources are not readily robots.txt and crawl where ever they want. More advanced
available to the users. “bad bots” check the robots.txt and see the “disallow” section
and specially traverse these areas. The following problems
Then how it is possible that a Web crawler finds a resource may be seen in ethical crawling:
that is either protected by a session or hidden behind an
authentication form?
x The crawler checks the robots.txt file, and omits the
So, the punch line is, to find “what are the security holes that directories/sections that are disallowed. There can be
allow a Web Crawler to intrude in and fetch the location of a some links, intentionally or un-intentionally provided
specific resource from a hidden database?” or “what features in the source that can enable a bot to go through that
are included in the crawlers that enable them to find hidden path and unknowingly traverse the disallowed
resources?” All these questions led us to study the general sections.
problems of web applications regarding security.
x There are bots that go deep into the sites root using
III. STUDY ON SECURITY HOLES AND VULNERABILITIES IN “hit-and-trial” method to guess the expected
WEB APPLICATION directories like „\cgi-bin\‟ and fetch the applications
Web application development is very different from placed there [20].
other environments. Web browser and the nature of HTTP
pose security pitfalls not found in traditional client-server Robots.txt demands a number of problems to be
applications. Security flaws in web applications can easily addressed implicitly, which are elaborated in the problem
bypass firewalls and other basic security measures, making area. The high-flying problems, of the jotted down, include:
the application vulnerable for intruders and crawlers to access
as unauthorized entity. Here is a brief description of problem a) Sanitization of user input: To get only the valid
areas and security holes which have to do with our part of input from user.
research. b) Maintenance of event log: To keep track of the
attacks, Server level rechecking of user input to ensure that
A. Detected Problem Areas and Security Holes of Web data is secure and passes the security checks and Security
Applications checks at application level due due to the reason, if a crawler
Inquiries were made from perspective of both, the by passes robot.txt then an application itself has security
checks implemented in form of access rights to avoid
crawler and web application perspective, to get either sides of vulnerabilities. A progressive elaboration may have inclusion
the picture. Study conducted to get an overview of the
of more problems to be targeted [20].
vulnerabilities [16][17][18] which make the web applications
insecure, gave insight to problems such as developers IV. SOLUTIONS TO THE TARGETED PROBLEMS
overlook the security aspects, malicious html tags, user input
is not sanitized, every file is placed in html directory, relying The problems, we intend to target can be uprooted by
on professed hidden fields, “Get” methods are used, inputs sanitization of input, using centralized approach for input
are not revalidated on server side, data is not encapsulated, validation, performing input validation on server side,
absolute path is not used, file opening modes are not addressing hidden fields and broken authentication[19]
specified, error detection logs are not created, crawling the problem by maintaining an event log, authorization problem,
disallowed content in robots.txt, broken authentication, data security and cryptography[16][20].
authorization problem [19], skipping configuration
management, insecure secrets, improper session handling, V. OTHER TECHNIQUES TO AVOID THE TARGETED
designing imperfection, poor cryptography [18], remote PROBLEMS
administration flaws, poor programming approaches, security Only solutions are not required to be implemented on the
attacks are addressed on network level, delinquency in core of a web application to ensure security. But during its
vulnerability analysis [18][19][20]. development adapting to the security principles and
performing various testing types can also help achieving the
B. Targeted Problems and their Rationales purpose. Tainted Model, Black-Box Testing, White-Box
From the list of problems elaborated so far, specific ones Testing, Penetration Testing, Improved Tainted Model,
Proactive and Reactive Strategies are the testing techniques
137
2013 International Conference on Open Source Systems and Technologies (ICOSST)
and Security Principles are to Compartmentalize, Usage of Fig. 5 describes the architecture used to develop a throw
Least Privilege, Applying Defense in Depth, not Trusting away prototype, The RebelAnt; and highlighted areas as
User Input, Failing Securely, Securing the Weakest Link, suggested solutions to be incorporated. The proposed
Creating Secure Defaults [16][20]. prototype contained a URL locator, settings controller,
controller controls, scheduler, ready queue, suspended queue,
VI. PROPOSED DESIGN AND ARCHITECTURE URL store, controller, policy comparator and resource
handler. Policy comparator and resource handler are added in
In addition to the problems and vulnerabilities exposed, a the suggested architecture.………………………………….....
need to put in some new components as solution in the radical
design was considered.
138
2013 International Conference on Open Source Systems and Technologies (ICOSST)
139
2013 International Conference on Open Source Systems and Technologies (ICOSST)
[18] S. Kals, E. Kirda, C. Kruegel and N. Jovanovic, “SecuBat: a web [21] “Robots.txt”, (2012,January) [Online]
vulnerability scanner”,In Proc. 15th international conference on World http://www.robotstxt.org/robotstxt.html
Wide Web, 2011. [22] B. Arif, A.U Nisa, Q. Shafi, H.N Qureshi, U.H Siddiqui and T. Tariq.
[19] J. Symons, “Writing secure web applications”, 2011. (2012,June)[Online]
[20] A. Mackman, M. Dunner, S. Vasireddy, R. Escamilla J.D. Meier and A. https://drive.google.com/file/d/0BxqzIs337V6qemZBX25tbDFCVG8/e
Murukan. (2003, June) Microsoft Corporation.[Online]. dit?usp=sharing
http://msdn.microsoft.com/en-us/library/ff649874.aspx
140