Web Crawlers: To Detect Security Holes

2013 International Conference on Open Source Systems and Technologies (ICOSST)
Web Crawlers
To Detect Security Holes
Arooj un Nisa Qasim Shafi

Bushra Arif Punjab University College of
Punjab University College of
University of Engineering and Information Technology Lahore, Information Technology Lahore,
Technology Lahore, Pakistan Pakistan Pakistan
bushrariff@yahoo.com aroojnawaz92@yahoo.com qasi123@hotmail.com
Huzaima Naheed Qureshi Umm e Habiba Siddiqui Tayyaba Tariq

Punjab University College of Punjab University College of Punjab University College of
Information Technology Lahore, Information Technology Lahore, Information Technology Lahore,
Pakistan Pakistan Pakistan
huzaimaqureshi@yahoo.com bubbles_5577@yahoo.com tayyaba.tariq@pucit.edu.pk
Abstract—Today, the web is all about the dynamic content; the

information created whilst it is needed i.e. the resources are not A. What is a crawler?
readily available to the users. Then how it is possible that a web “A crawler is a program that visits web sites and reads
crawler finds a resource that is either protected by a session or their pages and other information in order to create entries for
hidden behind an authentication form? The query triggered to a search engine index”. The main search engines on the web
look-for the answers to the questions on web crawlers which are; have such a program, which is also known as a "spider" or a
what is a crawler? Why it’s needed? How it works? "bot." Crawlers are generally programmed to visit sites that
Implementation of a typical crawler, How crawlers are have been submitted by their owners as new or updated.
categorized? For a comprehensive study on the existing web Crawlers apparently gained the name because they crawl
crawlers their comparative analysis on the basis of different through a site and a page at a time [2].
attributes to find problems, from crawler and application
perspective, research was conducted. Moreover, after discerning “Crawler is a machine that saves the visited links in database,
the grounds, architecture of a prototype that ensures and and those related links as well which are on the web pages as
inspects the security of online assets in order to maintain the reference links”. Crawler acts like a spider and gets
information security by finding where are the security holes that information from even hidden areas of web pages [2].
allow a web crawler to intrude in and fetch the location of a
Crawling basically refers to gathering web pages, means
specific resource from a hidden database? Or what features are
storing their hyperlinks in database; it starts from little and
included in the crawlers that make them good enough to find
gathers large number of links by visiting the pages and
hidden resources? This paper is an attempt to find out the
answer to both questions in either way, to cope up with the
observing the reference links on those pages. Web crawlers
security issues and vulnerabilities of online resources. start by parsing a specified web page, considering any
hypertext links on that page, which point to other web pages.
Keywords—Crawler,Online Assets, Dynamiccontent, Then they parse those pages for new links, and continue
Vulnerabilities. recursively. When the crawler collects pages from the Web, it
uses resources which belong to other organizations [2]. In the
History, the era of web crawler is as the era of browsing [7].
I. INTRODUCTION
A brief overview of development of web crawler is presented
This paper consists of the research conducted on the topic of below:
“web crawlers”, innovatively pruned to secure online
resources. Starting from the root level, different definitions of x Web crawler was first developed by Brian Pinkerton
a web crawler, its rationale, working, implementation, as his free time activity, at first it was a desktop
application, but it’s a wide web service these days.
categories and a comparison of different crawlers is
(Jan 27, 1994)
presented. After setting the grounds, goals and motivation of
the study are explicitly discussed. Security holes and x Then web crawler went live with web and contained a
vulnerabilities in web applications are studied and idea of a database of pages over 4000 from different web sites.
web crawler to secure the online assets is discussed. Later on, (April 20, 1994)
perceiving the problems and the concept behind the research x After that web crawler found two sponsors, Dealer
work, solutions to the weighed down problems leading Net and Starwave. They both helped financially to
keep web crawler operating. (Dec 1st, 1994)
towards a throw away prototype, a high level design and
architecture incorporating the suggested solution in basic x America acquired web crawler online when Over 1
design of a crawler, are deliberated. million users started crawling. (June 1, 1995)
978-1-4799-2046-4/13/$31.00 ©2013 IEEE 133

x Web crawlers got new look, many designs changed, 1) Focused Crawling: It is to selectively seek out pages
and spider term was started using for it. (Sept 4th that are relevant to a pre-defined set of topics emphasizing on
1995) relevant links and skipping irrelevant links [6] [8].
x In 2008 web crawler consists of about 2.5 billion 2) Intelligent Crawling: It is actually referred to learning
pages, with a total of 19 terabyte data. characteristics of the linkage structure of the World Wide
x Now-a-days different types of web crawling Web while performing the crawling. No specific model for
techniques are in use such as focused crawling and web linkage structure is assumed [9].
intelligent crawling.
D. Implementation of a Typical Crawler?
After 1995 web crawlers were recognized, they kept on After knowing behavior and working of a crawler along
moving and getting different names according to their crawling types, we need to know the basic components of a
functionality and changing behavior. They got different crawler to move to the implementation of a typical crawler. A
designs, sponsors, companies and different ways of searching crawler contains two basic components; one is Crawling
the related links and data [9]. Application which handles request, issues URL‟s according to
the request, and second is Crawling System which downloads
B. Why a Crawler is Needed? pages and supplies to crawling application [10][6].
These days the web is absolutely the dynamic content and For the development of a basic crawler, two crawlers were
information is created when it is needed. Web pages update studied illustrating evolution from basic to an improved one.
repeatedly so, complications and the changing web This study was intended to get an idea of the architecture of an
environment has affected the crawlers to evolve. In the elementary crawler with fundamental features [11].
beginning, a subset of the documents was crawled on the web
with the modified breadth-first traversal and it was sufficient. 1) The First Crawler: Design of the first crawler is
But more recently the scale, complexity, and dynamic nature elaborated in the Fig.1.The first crawler had four fundamental
of the Web have changed many of the requirements making parts: a database to store state, a set of agents to retrieve
the crawling process more difficult. pages from the Web, a dispatcher to coordinate the agents and
database, an indexer to update the full-text index with newly
That’s why articulated crawlers were needed to cope with
retrieved documents. The basic first web crawler has only one
the problem.
policy that is collection policy.
As per the Collection Policy for first web crawler, it
C. Behavior and Working of a Crawler
determines which URLs are to be collected and submitted for
The behavior of a web crawler is the outcome of a indexing, and which are not. Collection policy was made to
combination of following policies; first one is Selection
distinguish related documents from unrelated, to increase
policy that states which pages to download, Re-visit policy
states when to check for changes to the pages, Politeness efficient search by providing indexing in a hypertext domain
policy states how to avoid overloading web sites, and avoiding the creation of a larger index [1]. Web
Parallelization policy states how to coordinate distributed Crawler’s algorithm was in fact a simple breadth-first
web crawlers. traversal, done at the level of the server table, not the
document table.
The basic algorithm on which web crawler operates is
quite simple. From a set of candidate URLs, the web crawler
selects one URL then downloads the web pages associated to
that selected URL, extract the URL contained in that web
page and then add such URLs in candidate set, that have not
been encountered yet [2]. Once it has been determined that a
URL has not been previously discovered, it is added to the
frontier set containing the URLs that have yet to be
downloaded. Frontier set is generally too large. Generally the
URLs must be crawled in such a way that maximizes the
utility of the crawled amount. A high-quality, highly-
demanded and fast-changing page is re-crawled frequently,
while high-quality but slow-changing and fast-changing but
low-quality pages receive a lower priority. To implement a Fig 1. Design of the First Crawler
web crawler on large scale can be quite complex. At large
scale formerly discussed polices and things are include in the Traversal at server level was chosen because documents
crawler and they also follow the robot .txt standard while on a particular server are likely to be related with structured,
crawling pages. Politeness policy controls the amount of easy-to-follow links among them. So, getting a few
traffic given to any particular web server [2].Web crawlers representative documents from each server is a
have basically two ways of crawling: straightforward way to achieve Web Crawler’s collection
policy. Visiting servers in a breadth-first fashion
134
automatically creates delays between subsequent visits to a information would also be obtained for them, and that data
single server is the other one. These delays are appreciated by would be used to select the best documents for the collection
server administrators [1]. [1].
E. Requirements to be a good Crawler
a) Problems and Challenges for First Web Crawler:
This first crawler worked fine for six months. But, as the size A good crawler must fulfill these requirements; Flexibility
of the web increased, the implementation faced three major that Means the system can be used in variety of scenarios,
problems of fault tolerance, scale and politeness. The most Low cost and high performance [7], Robustness, able to
severe of these problems was fault tolerance. Scale also tolerate crashes and network interruptions without losing too
became an issue with the growth of web. The third problem much of data, Speed Control not putting too much load on
with the early crawler was that, although it was relatively single server while downloading and it should be manageable
polite to individual servers, it did not obey the Standards for and reconfigurable [10].
Robot Exclusion [1].
F. How crawlers are categorized?
b) Modification to releive the Problem: Keeping in
mind the fault tolerance problem, the first modification made While studying the typical crawler implementation, two
to the first crawler was not to index when in real time while categories of crawlers; Distributed and Agile rendered. We
crawling, and just to put the raw HTML into the file system present the design of both categories separately.
as it was retrieved from the web. Thus indexing and crawling 1) Distributed web crawler: A distributed web crawler is
processes were separated and this prohibited failures in one of designed such that it scales to hundreds of pages per second.
both from affecting the stability of the other. But, later Fig. 3 elaborates design of a distributed web crawler. It
experience with this modified crawler too showed problems consists of Crawl manager, Crawling Application, Domain
with fault tolerance, especially as the database grew large Name Server (DNS) resolver and downloader.
than before. On the other hand scale and politeness still A Crawl manager is the basic component of distributed
remained a problem [1].
crawler and visible to all components. It receives requests for
2) The Second Crawler: The second crawler was URL’s, after enqueueing, loads the matching file to start
designed to evade some of the problems come across the first download. After loading the URLs the manager requests the
Crawler and the issues of fault tolerance, scale, and flexibility DNS resolver for the IP addresses of the servers and then it
were particularly addressed. The architecture of this new requests the robots.txt file in the web server root directory.
system was almost same based on the old one, but with a few Finally, after parsing robots file and removing excluded
important differences shown in Fig. 2 [1].This crawler still URLs, the requested URLs are sent into batches to
had a database in its central part, however this time the downloader.
database was implemented using a commercial, full-scale
database, Oracle. The database was encircled by a number of
key processes. The main dispatcher process retrieved pages to
index from the database, and passed them out to a series of
agents. These agents retrieved pages from the web, and
passed their results back to an updater process. The updater
was responsible to queue the pages for indexing and updating
the database with the new metadata from the pages’ retrieval.
Fig. 2. Design of the Second Crawler Fig. 3. Design of the Distributed Crawler
A new strategy for the collection was made which said that Downloader with adjustable speed receives more than
many documents would be crawled and metadata and link hundreds of pages per second and a large number of pages
135
have to be written out. The Crawling Application checks form of tables after going through the details of the crawlers
downloaded pages for hyperlinks and if they have not been which are mostly open source or their documentation was
visited already then sent to the manager in the form of available. Comparison Criteria include:
batches. Performance, parsing and network speed of
distributed web crawler can be scaled up by adding additional
low-cost components [10]. x Searching Mechanism which consists of Indexing and
ranking methods are considered as the searching
2) Agile web Crawler: For agility of a web crawler a mechanism. Table 1 at [22] sums up the comparison
drawn based on searching mechanism among the
single database is used (My Sql).Fig. 4 is the crawling cycle
crawler-like software packages.

functional diagram which shows that when a web crawler get
a URL to read it saves a cookie and asks for the source code x Crawler and Indexer Features which contain
of that URL. After reading the source code if it have an error functionalities of built-in web crawlers and indexers such
then it shows the error message and save the status of this as Robot Exclusion Standard Support, Crawler Retrieval
process but if it read correctly then it extracts the data and Depth Control, Duplicate Detection, File Format to be
checks its database limit and save that link, save the source indexed and Index Protected Server. Table 2 at [22]
code and then terminate the process while saving the status. compares the crawlers mentioned above based upon
In the database the saved links are saved on the basis of crawler and indexer features.

page’s popularity ranking, most visited on top then on second x Searching features, considered from nine aspects;
number and so on. Agility of a web crawler is easy to modify Boolean Search, Phrase Matching, Attribute Search,
at crucial level to extend system quickly, without losing time Fuzzy Search, Word Forms, Wild Card, Regular
on modifying entire application. Agility of a web crawler Expression, Numeric Data Search, Case Sensitivity,
allows robustness and speed, manageability, good structural Nature Language Query. Table 3 at [22] shows the
analysis, correct data extraction and a fast search system [15]. comparison among the crawler software based on
searching features.
x Other Features are International Language, Page Limit,
and Customizable Result Formatting. Table 4 at [22]
contains comparison and contrast based on the formerly
mentioned attributes.
H. Security Detection Tools
During research, we found different tools which ensure
application security but these are desktop based or simple
plug-ins. They don’t go with the idea we propose. These
security tools include, Exploit me, Watcher, N-Stalker free
version, Netsparker community Edition, Web Security,
Wapiti, SkipFish, Scrawl, X5s, WebScarab and Acunetix.[17]
These tools are ranked in Table 5 at [22] based upon the
supporting operating systems, functionalities and their types.
II. MOTIVATION AND GOALS OF RESEARCH

Our main idea is all about combining the crawling and asset
security, to ensure application protection. Starting from basics
we matured and blended the elementary idea of crawling with
online asset security. So the basic thought is to use a web
crawler to detect the security holes of an application and find
its vulnerabilities. Hence the motivation behind the paper is;
how crawlers get access to disallowed contents of a web
application? What are the vulnerabilities in web application
Fig. 4. Design of the Agile Web Crawler which allow crawlers to do so? Implementation of the idea of
a crawler which specifically crawls the web applications to
measure their security and tell the Admins that what kind of
G. Comparative analysis of existing crawlers
problems and at what points their applications are exposed to
For comparative analysis of the existing crawlers 10 security attacks.
software packages which server as crawlers named; Alkaline, Goals of research are; a comprehensive study on the existing
Fluid Dynamic, ht: //Dig, Juggernaut search 1.0.1, mnoGo web crawlers, their comparative analysis on the basis of
Search, Perlfect, SWISH-E, Webinator, Webglimpse 2.x and different attributes, Problems from crawler and application
Googlebot were studied[12] [13] [14]. Here, the comparison perspective, Solutions to the to-be targeted problems and their
criteria are discussed first and comparisons are summed up in rationales, Practices to avoid those problems at the
136
implementation level of the web applications, Study of targeted in our study from crawler’s point of view are
programs related to our work, Implementation of basic discussed here.
crawler and proposing design and architecture of the
prototype intended to detect the security holes. 1) ROBOTS.txt: It’s a file mentioned in the “meta‟ tag of
web pages. Robots.txt [21] allows or disallows bots to
Right after motivation and proposing a research plan we came traverse through specific directories. The commercial bots,
across the question, “How a web crawler is supposed to the bots like Google bot, Microsoft’s bot and other bots
secure online assets?” which follow the legal rules; ethically follow the robots.txt. It
The question posed, drove our attention because today, the is totally on the developer’s behalf that he programs his bot to
web is the dynamic content; the information created at the follow the rules or not. The bad bots usually skip the
same time as it is needed i.e. the resources are not readily robots.txt and crawl where ever they want. More advanced
available to the users. “bad bots” check the robots.txt and see the “disallow” section
and specially traverse these areas. The following problems
Then how it is possible that a Web crawler finds a resource may be seen in ethical crawling:
that is either protected by a session or hidden behind an
authentication form?
x The crawler checks the robots.txt file, and omits the
So, the punch line is, to find “what are the security holes that directories/sections that are disallowed. There can be
allow a Web Crawler to intrude in and fetch the location of a some links, intentionally or un-intentionally provided
specific resource from a hidden database?” or “what features in the source that can enable a bot to go through that
are included in the crawlers that enable them to find hidden path and unknowingly traverse the disallowed
resources?” All these questions led us to study the general sections.
problems of web applications regarding security.
x There are bots that go deep into the sites root using
III. STUDY ON SECURITY HOLES AND VULNERABILITIES IN “hit-and-trial” method to guess the expected
WEB APPLICATION directories like „\cgi-bin\‟ and fetch the applications
Web application development is very different from placed there [20].
other environments. Web browser and the nature of HTTP
pose security pitfalls not found in traditional client-server Robots.txt demands a number of problems to be
applications. Security flaws in web applications can easily addressed implicitly, which are elaborated in the problem
bypass firewalls and other basic security measures, making area. The high-flying problems, of the jotted down, include:
the application vulnerable for intruders and crawlers to access
as unauthorized entity. Here is a brief description of problem a) Sanitization of user input: To get only the valid
areas and security holes which have to do with our part of input from user.
research. b) Maintenance of event log: To keep track of the
attacks, Server level rechecking of user input to ensure that
A. Detected Problem Areas and Security Holes of Web data is secure and passes the security checks and Security
Applications checks at application level due due to the reason, if a crawler
Inquiries were made from perspective of both, the by passes robot.txt then an application itself has security
checks implemented in form of access rights to avoid
crawler and web application perspective, to get either sides of vulnerabilities. A progressive elaboration may have inclusion
the picture. Study conducted to get an overview of the
of more problems to be targeted [20].
vulnerabilities [16][17][18] which make the web applications
insecure, gave insight to problems such as developers IV. SOLUTIONS TO THE TARGETED PROBLEMS
overlook the security aspects, malicious html tags, user input
is not sanitized, every file is placed in html directory, relying The problems, we intend to target can be uprooted by
on professed hidden fields, “Get” methods are used, inputs sanitization of input, using centralized approach for input
are not revalidated on server side, data is not encapsulated, validation, performing input validation on server side,
absolute path is not used, file opening modes are not addressing hidden fields and broken authentication[19]
specified, error detection logs are not created, crawling the problem by maintaining an event log, authorization problem,
disallowed content in robots.txt, broken authentication, data security and cryptography[16][20].
authorization problem [19], skipping configuration
management, insecure secrets, improper session handling, V. OTHER TECHNIQUES TO AVOID THE TARGETED
designing imperfection, poor cryptography [18], remote PROBLEMS
administration flaws, poor programming approaches, security Only solutions are not required to be implemented on the
attacks are addressed on network level, delinquency in core of a web application to ensure security. But during its
vulnerability analysis [18][19][20]. development adapting to the security principles and
performing various testing types can also help achieving the
B. Targeted Problems and their Rationales purpose. Tainted Model, Black-Box Testing, White-Box
From the list of problems elaborated so far, specific ones Testing, Penetration Testing, Improved Tainted Model,
Proactive and Reactive Strategies are the testing techniques
137
and Security Principles are to Compartmentalize, Usage of Fig. 5 describes the architecture used to develop a throw
Least Privilege, Applying Defense in Depth, not Trusting away prototype, The RebelAnt; and highlighted areas as
User Input, Failing Securely, Securing the Weakest Link, suggested solutions to be incorporated. The proposed
Creating Secure Defaults [16][20]. prototype contained a URL locator, settings controller,
controller controls, scheduler, ready queue, suspended queue,
VI. PROPOSED DESIGN AND ARCHITECTURE URL store, controller, policy comparator and resource
handler. Policy comparator and resource handler are added in
In addition to the problems and vulnerabilities exposed, a the suggested architecture.………………………………….....
need to put in some new components as solution in the radical
design was considered.
Fig 5. Architecture of Prototype
138
draw comparisons on both the implementation and the user

A. Working (Architecture Explained):
policies. This module is still under construction due to the
1) Loader: As the RebelAnt loads, a simple user interface lack of the specified knowledge in the area of Code
appears that asks the user to enter a URL in a standard format Analyzing techniques. Hopefully, it will be completed soon.
i.e. “http://www.google.com”. RebelAnt is designed in such a 10) Resource Handler: The resource handlerworks under
way that the user has full control over the software as it the policy comparator and helps it to use the 3rd party
traverses through the web/software in any order. URL resources that are particularly in use by the test application.
Locator: The URL locator takes the stating URL from the
user and starts searching process form the internet. It sends VII. FUTURE DIRECTIONS
the URL to the controller that creates a thread and pushes it to Research enabled us to design a throw-away prototype
the scheduler[3][4][5]. after getting an overview of the basics of crawler. Concluding
2) Settings Controller: The user has given a options to our work, we proposed architecture of the crawler we wanted
control the number of threads, alive connections, and file to be implemented. Due to the constraints of time we are
exclusions. The default value for alive threads is “ten"; for parting the implementation for future. Directions and basic
MIME types, they are all selected and for the exclusions, sketch needed to execute the idea has been congregated.
none are selected. If the user does not set or alter the thread Further research to continue with updates on the topic is
count, the system will proceed with the defaults. hailed for a flawless implementation. Implementation of
3) Control Controler: To stop, resume or to add a new policy comparator and resource handler would lead to
URL, the controller can be manually operated. The user has realization of a web crawler which ensures security of online
the former mentioned options. If a thread gets halted, the user assets.
can safely stop the process and can resume as well.
4) Scheduler: The scheduler makes sure that thread is REFERENCES
suffering from starvation and the movement of threads from [1] B. Pinkerton, “WebCrawler: finding what people want”, University of
Washington, pp.93, 2000.
the ready queue to suspended queue and vice-versa does not [2] C. Castillo, “Effective web crawling”, ACM SIGIR Forum, vol.39,
result in the loss of any thread. Moreover the scheduler keeps pp.55-56, Nov. 2009.
the track of the time for the running threads so that if a [3] E. Fong and V. Okun, “Web application scanners: definitions and
thread/connection is taking too long, it terminates that functions”, pp.7, 2006.
[4] J. Nilsson, W. Lowe, J. Hall and J. Nivre, “Parsing formal languages
specific thread. using natural language parsing techniques”, Sweden, pp.49-60.
5) Ready Queue: All the threads that theController passes [5] D. Grune and C.J.H Jacobs, Parsing Techniques: A Practical Guide,
to the scheduler are added to the ready queue. As per the 2nd ed., Springer, (1990).
[6] P. Gupta and K. Johari, “Implementation of web crawler”, In Proc. 2nd
defaults, a maximum of 10 threads can be placed in the ready International Conference on Emerging Trends in Engineering and
queue. It means with a depth of 3 threads, there are total 30 Technology (ICETET), Dec.16-18.2009.
threads and the ready queue can contain up to 10 threads. The [7] M. Najork and A. Heydon, “High-performance web crawling”, Compaq
SRC Research Rep. 173, 2001.
ready queue serves as the place where the active threads [8] M.K. Dey, H.M.S. Chowdhury, D. Shamanta and K.E.U. Ahmed,
reside. “Focused web crawling: A framework for crawling of country based
6) Suspended Queue: If a crawling process is ongoing, financial data”, In Proc.2nd IEEE International Conference on
and the user stops it, all the threads that are in the ready Information and Financial Engineering (ICIFE), Sept.17-19.2010.
[9] K. Hu and W.S. Wong, “A probabilistic model for intelligent web
queue, are moved to the suspended queue (and are saved on crawlers”, In Proc.27th Annual International Computer Software and
disk) and the ready queue is made empty for the new job. If Applications Conference, COMPSAC, Nov.3-6.2003.
the user does not start a new job and resumes the process, the [10] V. Shkapenyuk and T. Suel, “Design and implementation of a high-
performance distributed web crawler”, pp.12, 2001.
threads in the suspended queue move back to the ready queue [11] M. Najork, “Web crawler architecture”, pp.3, 2011.
and the job resumes. [12] U. Manber, M. Smith and B. Gopal, “WebGlimpse - combining
7) URL Store: All the URLs/Links that the RebelAnt browsing and searching”, In Proc. Usenix Technical Conference, Jan.6-
10.1997. To appear.
traverses are stored in a repository for the future use and to [13] U. Manber and S. Wu, “GLIMPSE: a tool to search through entire file
keep a record so that no loops may occur. systems”, In Proc. USENIX Winter Technical Conference, 1994.
8) Controller: The controller’s main job is to [14] (2010, March) Thunderstone Webinator. [Online].
http://www.thunderstone.com/site/webinator5man/
communicate between all the components and provide the [15] A. Kurapov, “Agile web Crawler: design and implementation”, pp. 37,
necessary interfaces to the different components. Moreover, 2007.
the controller generates the threads for traversing and passes [16] A. Petukhov and D. Kozlov, “Detecting security vulnerabilities in web
to the scheduler. applications using dynamic analysis with penetration testing”, In Proc.
OWASP Application Security Conference, Ghent Belgium, pp.16,
9) Policy Comparator: The policy comparator will work May.19-22.2008.
on the server end and will parse the source code of the test [17] “Security Detection Tools”, [Online]
web application and after analyzing it will take the http://www.webresourcesdepot.com/10-free-web-application-security-
testing-tools/
policies/rules form the user/developer that what a certain
function/variable/or a class was intended to do. Then it will
139
[18] S. Kals, E. Kirda, C. Kruegel and N. Jovanovic, “SecuBat: a web [21] “Robots.txt”, (2012,January) [Online]
vulnerability scanner”,In Proc. 15th international conference on World http://www.robotstxt.org/robotstxt.html
Wide Web, 2011. [22] B. Arif, A.U Nisa, Q. Shafi, H.N Qureshi, U.H Siddiqui and T. Tariq.
[19] J. Symons, “Writing secure web applications”, 2011. (2012,June)[Online]
[20] A. Mackman, M. Dunner, S. Vasireddy, R. Escamilla J.D. Meier and A. https://drive.google.com/file/d/0BxqzIs337V6qemZBX25tbDFCVG8/e
Murukan. (2003, June) Microsoft Corporation.[Online]. dit?usp=sharing
http://msdn.microsoft.com/en-us/library/ff649874.aspx
140

Web Crawlers: To Detect Security Holes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Crawlers: To Detect Security Holes

Uploaded by

Copyright:

Available Formats

2013 International Conference on Open Source Systems and Technologies (ICOSST)

Arooj un Nisa Qasim Shafi

Huzaima Naheed Qureshi Umm e Habiba Siddiqui Tayyaba Tariq

Abstract—Today, the web is all about the dynamic content; the

978-1-4799-2046-4/13/$31.00 ©2013 IEEE 133

II. MOTIVATION AND GOALS OF RESEARCH

Fig 5. Architecture of Prototype

draw comparisons on both the implementation and the user

You might also like