You are on page 1of 92

CONTENTS

LIST OF FIGURE IV

1 SYNOPSIS 1
1.1 PROJECT TITLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 PROJECT OPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 INTERNAL GUIDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 SPONSORSHIP AND EXTERNAL GUIDE . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.5 TECHNICAL KEYWORDS (AS PER ACM KEYWORDS) . . . . . . . . . . . . . . . . . 1
1.6 PROBLEM STATEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.7 ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.8 GOALS AND OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.9 Relevant mathematics associated with the Project . . . . . . . . . . . . . . . . . . . . . 3
1.10 REVIEW OF CONFERENCE/JOURNAL PAPERS SUPPORTING PROJECT IDEA . 5
1.11 Plan of project execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 TECHNICAL KEYWORDS 7
2.1 AREA OF PROJECT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 TECHNICAL KEYWORDS: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 INTRODUCTION 8
3.1 PROJECT IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 LITERATURE SURVEY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 PROBLEM DEFINITION AND SCOPE 11


4.1 PROBLEM STATEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.1 Goals and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1.2 Statement of scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 MAJOR CONSTRAINTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 METHODOLOGIES OF PROBLEM SOLVING AND EFFICIENCY ISSUES . . . . . . . 13
4.4 OUTCOME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

I
4.5 APPLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.6 HARDWARE RESOURCES REQUIRED . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.7 SOFTWARE RESOURCES REQUIRED . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 PROJECT PLAN 16
5.1 PROJECT ESTIMATES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.1.1 RECONCILED ESTIMATES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 Project Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 RISK MANAGEMENTW.R.T. NP HARD ANALYSIS . . . . . . . . . . . . . . . . . . . 18
5.3.1 Schedule Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3.2 Budget Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3.3 Operational Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3.4 Technical Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3.5 Project Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.4 PROJECT SCHEDULE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.5 TEAM ORGANIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.5.1 Management reporting and communication . . . . . . . . . . . . . . . . . . . . . . 21

6 SOFTWARE REQUIREMENT SPECIFICATION 23


6.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.1.1 Purpose and Scope of Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.1.2 Overview of responsibilities of Developer . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 USAGE SCENARIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2.1 User profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2.2 Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Functional Model and Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3.1 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3.2 Activity Diagram: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3.3 Non Functional Requirements: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3.4 State Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3.5 Design Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3.6 Software Interface Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 DETAILED DESIGN DOCUMENT USING APPENDIX A AND B 34


7.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2 ARCHITECTURAL DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.3 COMPOENT DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3.1 ClassDiagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 II


8 PROJECT IMPLEMENTATION 38
8.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.2 TOOLS AND TECHNOLOGIES USED . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.3 METHODOLOGIES/ALGORITHM DETAILS . . . . . . . . . . . . . . . . . . . . . . . . 39
8.4 VERIFICATION AND VALIDATION FOR ACCEPTANCE . . . . . . . . . . . . . . . . 42

9 SOFTWARE TESTING 44
9.1 TYPE OF TESTING USED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.2 TEST CASES AND TEST RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

10 RESULTS 49
10.1 INPUT SCREEN SHOTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
10.2 OUTPUTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

11 DEPLOYMENT AND MAINTENAN 54


11.1 INSTALLATION AND UN-INSTALLATION . . . . . . . . . . . . . . . . . . . . . . . . . 54
11.2 USER HELP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

12 CONCLUSION AND FUTURE SCOPE 59

13 62
List of Figures

1.1 Plan of project execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

6.1 Use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


6.2 Use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Data Flow Diagram 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.4 DATA FLOW DIAGRAM 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.5 User Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6 Admin Activity Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.7 state Daigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.1 Architectural Design Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36


7.2 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

13.1 Plan OF Project Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70


Chapter 1

SYNOPSIS

1.1 PROJECT TITLE

Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

1.2 PROJECT OPTION

Internal Project

1.3 INTERNAL GUIDE

Prof. Bharati Gaikwad

1.4 SPONSORSHIP AND EXTERNAL GUIDE

1.5 TECHNICAL KEYWORDS (AS PER ACM KEYWORDS)

1.3.1 Meta search,

1.3.2 Two stage crawler,

1.3.3 Page Ranking,

1.3.4 Reverse searching

http://ijircce.com/current-issue.html

1
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

1.6 PROBLEM STATEMENT

An effective deep web harvesting framework, namely Smart Crawler, for achieving both wide coverage
and high efficiency for a focused crawler. Based on the observation that deep websites usually contain
a few searchable forms and most of them are within a depth of three our crawler is divided into two
stages: site locating and in-site exploring. The site locating stage helps achieve wide coverage of sites for
a focused crawler, and the in-site exploring stage can efficiently perform searches for web forms within a
site.

1.7 ABSTRACT

As deep web grows at a very fast pace, there has been increased interest in techniques that help efficiently
locate deep-web interfaces. However, due to the large volume of web resources and the dynamic nature
of deep web, achieving wide coverage and high efficiency is a challenging issue. We propose a two-stage
framework, namely Smart Crawler, for efficient harvesting deep web interfaces. In the first stage, Smart
Crawler performs site-based searching for center pages with the help of search engines, avoiding visiting
a large number of pages. To achieve more accurate results for a focused crawl, Smart Crawler ranks
websites to prioritize highly relevant ones for a given topic. In the second stage, Smart Crawler achieves
fast in-site searching by excavating most relevant links with an adaptive link-ranking. To eliminate bias
on visiting some highly relevant links in hidden web directories, we design a link tree data structure to
achieve wider coverage for a website.

Our experimental results on a set of representative domains show the agility and accuracy of our
proposed crawler framework, which efficiently retrieves deep-web interfaces from large-scale sites and
achieves higher harvest rates than other crawlers.

1.8 GOALS AND OBJECTIVES

To design products that satisfy their target users, a deeper understanding is needed of their user charac-
teristics and product properties in development related to unexpected problems users face. These user
characteristics encompass cognitive aspect, personality, demographics, and use behavior. The product
properties represent operational transparency, interaction density, product importance, frequency of use
and so on. This study focuses on how user characteristics and product properties can influence whether
soft usability problems occur, and if so, which types. The study will lead to an interaction model that
provides an overview of the interaction between user characteristics, product properties, and soft usabil-
ity problems.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 2


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

1) The Objective is to record learned patterns of deep web sites and form paths for incremental
crawling.

2) Ranks site URLs to prioritize potential deep sites of a given topic. To this end, two features, site
similarity and site frequency, are considered for ranking

3) Focused crawler consisting of two stages: efficient site locating and balanced in-site exploring.
Smart Crawler performs site-based locating by reversely searching the known deep web sites for center
pages, which can effectively find many data sources for sparse domains.

4) Smart Crawler has an adaptive learning strategy that updates and leverages information collected
successfully during crawling.

1.9 Relevant mathematics associated with the Project

A] Identify the set of features

S = {f1, f2, f3...}

Where F is main module which includes the subset of f1, f2, f3...

B] Identify the set of form data called as documents

DC = { d1, d2, d3...}

Where DC is main set of document which include the sub form of d1, d2, d3...

C] Identify the set of root words

I = { i1, i2, i3 }

Where I is main set of cipher i1, i2, i3

D] Identify the set of stop words.

O = {o1, o2, o3...}

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 3


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Where O is main set of sniffets o1, o2, o3...

E] Identify the set of class detection.

CD = {cd1, cd2, cd3...}


Where CD is main set of class detection cd1, cd2, cd3... as positive and negative words

F] Identify the processes as P.

P = Set of processes

P = {P1, P2, P3,P4...}

Where,

J] Identify failure cases as FL

Failure occurs when

FL= { F1,F2,F3... }

1. F1= f j f if the data is not distributed in chunks

K] Identify success case SS:-

Success is defined as -

SS= {S1,S2,S3,S4 }

1. S1 = Data collection module

2. S2 = data analysis using different algorithm

3. S3 = data classification

4. S4 = analysis and results

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 4


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

I] Initial conditions as I0

1. User wants to provide the online dataset.

2. User selects the features.

1.10 Names of Conferences / Journals where papers can be published

International Journal of Innovative Research in Computer and Communication Engi-


neering
(A High Impact Factor, Monthly, Peer Reviewed Journal)Vol. 4, Issue 1, January 2016

1.10 REVIEW OF CONFERENCE/JOURNAL PAPERS SUP-


PORTING PROJECT IDEA

Due to heavy usage of internet large amount of diverse data is spread over it which provides access to
particular data or to search most relevant data. It is very challenging for search engine to fetch relevant
data as per users need and which consumes more time. So, to reduce large amount of time spend on
searching most relevant data we proposed the Advanced crawler. In this proposed approach, results
collected from different web search engines to achieve meta search approach. Multiple search engine for
the user query and aggregate those result in one single space and then performing two stages crawling on
that data or Urls. In which the sight locating and in-site exploring is done for achieving most relevant
site with the help of page ranking and reverse searching techniques. This system also works online and
offline manner.

1.11 Plan of project execution

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 5


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Figure 1.1: Plan of project execution

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 6


Chapter 2

TECHNICAL KEYWORDS

2.1 AREA OF PROJECT

Project area is depends on the web mining and text mining.

2.2 TECHNICAL KEYWORDS:

• Meta search,

• Two stage crawler,

• Page Ranking,

• Reverse searching

• http://ijircce.com/current-issue.html

7
Chapter 3

INTRODUCTION

3.1 PROJECT IDEA

In web search applications, queries are submitted to search engines to represent the information needs
of users. However, sometimes queries may not exactly represent users specific information needs since
many ambiguous queries may cover a broad topic and different users may want to get information on
different aspects when they submit the same query.

User search goals is the information on different aspects of a query that user groups want to obtain.
User search goals can be considered as the clusters of information needs for a query. The aim is to
discover the number of diverse user search goals for a query and depicting each goal with some keywords
automatically.

3.2 MOTIVATION

Owing the following features the crawling of hidden web assumes implications towards collection of highly
distilled relevant information for storage at the search engine site.

1. The first motivation is that the deep web is inaccessible to most search engine crawlers, which
means that many users are unaware of its rich content. It is difficult to index the deep web using tradi-
tional methods [7][8].

2. Web pages are updated very frequently. It is a general fact that pages larger in size changes more
frequently and are more extensible then smaller sizes pages.

3. The size of the web is so huge [3] that it cant be measured accurately. Deep Web contains very
large and wide range of freely accessible databases, not searched or indexed by the search engines.

8
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

4. The deep web is self organized in the sense of that different web page are dynamically generated
for different purposes. Anyone can post a webpage on the Web. There are no standards and no policies
for structuring and formatting the web pages. In addition, there is no standard editorial review process
to find the errors, falsehoods and invalid statements. Due to the unconnected structure of the deep web,
new search techniques must be developed to retrieve the information [9][10].

3.3 LITERATURE SURVEY

Around 1993, ALIWEB was grown up as the web page identical to Archie and Veronica. Instead of
cataloguing files or text records, webmasters would submit a special systematize file with site informa-
tion[4]. The next development in cataloguing the web came later in 1993 with spiders. Like robots,
spiders scoured the web for web page information. These early versions looked at the titles of the web
pages, the header information, and the URL as a source for key words. The database techniques used
by these early search engines were primitive.

For example, a search process would give up hits (List of Links) in the order that the hits (List of
Links) were in the database. Only one of these search engines made effort to rank the hits (List of
Links) according to the websites relationships to the key words. The first popular search engine, Ex-
cite, has its roots in these early days of web Classifying. The Excite project was begun by a group of
Stanford undergraduates. It was released for general use in 1994[4]. From the recent few years there
are many methodologies and techniques are proposed searching the deep web or searching the hidden
data from one of the www. The design of Meta search engine and same deep web searching site are
the example of this engine. The first Meta search engine was created during the year of 1991-1994. It
provides the access to many search engines at a time by providing single query as input GUI. And its
name as a Meta crawler Proposed in university of Washington [3]. And after onward the work is still
going on to save the data from diving in Deep Ocean of internet there are one of the best example of
this invention is guided Google proposed by cheek Hong dmg and rajkumar bagga in which the use
Google API for searching and controlling search of Google. The inbuilt method and function library
is guided [Google]. In year 2011 web service architecture for Meta search engine was proposed by K
SHRINIVAS, P V S SHRINIVAS, A GOVARDHAN according to their study the Meta search engine
can be classified into two types first is general purpose search engine and second special purpose Meta
search engine. The previous search engine are focused on searching the complete web but year after year
to reduce the complexity focus is a to search information in particular domain [2]. Information retrieval
is a technique of searching and retrieving the relevant information from the database. The efficiency
of searching is measure using precision and recall .Precision Specifies the document which are retrieve
that are relevant and Recall Specifies that whether all the document that are retrieve are relevant or
not. Web Searching is also type of information retrieval because the user searchers the information on

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 9


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

web. The information that is search on web is called as web mining. Web mining can be classified
in three different types Web Content Mining, Web Structure Mining, Web Usage Mining. To retrieve
the complex query information is still checking for search engine is known as deep web. Deep web is
invisible web consist of publicly accessible pages with information in database such as Catalogues and
reference that a not index by search engine [2]. The Deep web is rapidly growth day over day and
to locate them efficiently there is need of effective techniques to achieve best result. Such a system
is effectively implemented is Smart Crawler, which is a Two Stage Crawler efficiently harvesting deep
web interfaces. By using some basic concept of search engine strategies they achieve the good result in
searching of most significant data. Those data techniques are as reverse searching, incremental searching.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 10


Chapter 4

PROBLEM DEFINITION AND SCOPE

4.1 PROBLEM STATEMENT

An effective deep web harvesting framework, namely Smart Crawler, for achieving both wide coverage
and high efficiency for a focused crawler. Based on the observation that deep websites usually contain
a few searchable forms and most of them are within a depth of three our crawler is divided into two
stages: site locating and in-site exploring. The site locating stage helps achieve wide coverage of sites for
a focused crawler, and the in-site exploring stage can efficiently perform searches for web forms within a
site.

4.1.1 Goals and objectives

To design products that satisfy their target users, a deeper understanding is needed of their user charac-
teristics and product properties in development related to unexpected problems users face. These user
characteristics encompass cognitive aspect, personality, demographics, and use behavior. The product
properties represent operational transparency, interaction density, product importance, frequency of use
and so on. This study focuses on how user characteristics and product properties can influence whether
soft usability problems occur, and if so, which types. The study will lead to an interaction model that
provides an overview of the interaction between user characteristics, product properties, and soft usabil-
ity problems.

1) The Objective is to record learned patterns of deep web sites and form paths for incremental
crawling.

2) Ranks site URLs to prioritize potential deep sites of a given topic. To this end, two features, site
similarity and site frequency, are considered for ranking

11
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

3) Focused crawler consisting of two stages: efficient site locating and balanced in-site exploring.
Smart Crawler performs site-based locating by reversely searching the known deep web sites for center
pages, which can effectively find many data sources for sparse domains.
4) Smart Crawler has an adaptive learning strategy that updates and leverages information collected
successfully during crawling.

4.1.2 Statement of scope

Search Engine Marketing was once was used as an umbrella term to encompass both SEO (search engine
optimization) and paid search activities. Over time, the industry has adopted the SEM acronym to refer
solely to paid search.

Google AdWords is by many measures the most popular paid search platform used by search mar-
keters, followed by Bing Ads, which also serves a significant portion of ads on Yahoo. Beyond that, there
are a number of 2nd tier PPC platforms as well as PPC advertising options on the major social networks.

In addition to covering general paid search trends, you can find the most recent news about SEM
and helpful tips to get started with PPC ads on the major search marketing platforms below:

• Google AdWords

• Bing Ads

• Yahoo: Search Ads

Each platform offers its own getting started guides and helpful tutorials. Another beginner re-
source is Googles Insiders Guide To AdWords (PDF). Since the guide was last updated in 2008,
the Google AdWords UI (user interface) has changed, along with several features, but the guide
may still offer a useful introduction.

At Search Engine Land, we generally use SEM and/or Paid Search to refer to paid listings, with
the longer term of search marketing used to encompass both SEO and SEM. Below are some of
the most common terms also used to refer to SEM activities:

• Paid search ads

• Paid search advertising

• PPC (pay-per-click) *

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 12


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

• PPC (pay-per-call) some ads, particularly those served to mobile search users, may be charged by
the number of clicks that resulted in a direct call from a smartphone.

• CPC (cost-per-click) *

• CPM (cost-per-thousand impressions) *

• Most search ads are sold on a CPC / PPC basis, but some advertising options may also be sold on
a CPM basis.

4.2 MAJOR CONSTRAINTS

This protocol is implemented in Java language .We also use HTTP/TCP/IP protocols. Java has had
a profound effect on the Internet. The reason for this is Java expands the universe of objects that can
be about freely on the Internet. There are two types of objects we transmit over the network, passive
and dynamic network programs also present serious problems in the areas of security and portability.
When we download a normal program we risk viral infection. Java provides a firewall to overcome these
problems. Java concerns about these problems by applets. By using a Java compatible Web browser we
can download Java applets without fear of viral infection.

4.3 METHODOLOGIES OF PROBLEM SOLVING AND EF-


FICIENCY ISSUES

Web Crawlers are programs that traverse the Web and automatically download pages for search engines.
Working of conventional crawler mainly depends on the hyperlinks on the Publically Indexable Web
to discover and download pages. Due to the lack of links pointing to the deep web pages, the current
search engines are unable to index the Deep Web pages. The major issues and challenges in designing
an efficient deep web crawler to crawl the deep web information are given as follows:

Identification of a proper entry point among the searchable forms is a major issue for crawling the
hidden web contents.If entry to the hidden web pages is restricted only through querying a search form,
two challenges arises i.e. to understand and model a query interface as well as handling of these query
interfaces with significant quire

There is a big challenge to access the relevant information through hidden web by identifying a rele-
vant domain among the large number of domains present on World Wide Web.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 13


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

A crawler should not affect the normal functioning of website during crawling process by over bur-
dening website. Sometimes a crawler with a high throughput rate may results in denial of service.

An efficient hidden web crawler [4] should automatically parse, process and interact with the form-
based search interfaces. A general PIW crawler only submit the request for the url to be crawled where as
a hidden web crawler also supply the input in the form of search queries in addition to the request for url..

Many websites presents query form to the users to further access their hidden database. The general
web crawler cannot access these hidden databases due to inability to process these query forms..

Many websites are equipped with client-side scripting languages and session maintenance mecha-
nisms. A general web crawler cannot access these types of pages due to inability to interact with such
type of web pages .

4.4 OUTCOME

We filtered the data obtained from web pages on servers to get text files as needed by the Semantic
Search engine. We could also filter out unnecessary URLs before fetching data from the server. Experi-
mental results reflect that it is efficient both for privatized as well as general search related to the deep
web information, hidden behind the html forms. Since the crawler employs appropriate domain specific
keywords to crawl the information hidden behind query interfaces, the quality of information is better
as established by the experimental results.

4.5 APPLICATIONS

1 RBSE was the first published web crawler. It was based on two programs: the first program, ”spider”
maintains a queue in a relational database, and the second program ”mite”, is a modified www browser
that downloads the pages from the Web.

2 Web Fountain is a distributed, modular crawler similar to Mercator but written in C++. It features
a ”controller” machine that coordinates a series of ”ant” machines. After repeatedly downloading pages,
a change rate is inferred for each page and a non-linear programming method must be used to solve the
equation system for maximizing freshness.

3 WebRACE is a crawling and caching module implemented in Java, and used as a part of a more
generic system called eRACE. The system receives requests from users for downloading web pages, so

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 14


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

the crawler acts in part as a smart proxy server. The system also handles requests for ”subscriptions” to
Web pages that must be monitored: when the pages change, they must be downloaded by the crawler
and the subscriber must be notified. The most outstanding feature of WebRACE is that, while most
crawlers start with a set of ”seed” URLs, WebRACE is continuously receiving new starting URLs to
crawl from.

4.6 HARDWARE RESOURCES REQUIRED

• Processor:- Intel Pentium 4 or above

• Memory:- 512 MB or above

• Other peripheral:- Printer

• Hard Disk:- 10gb

4.7 SOFTWARE RESOURCES REQUIRED

Technologies and tools used in Policy system project are as follows

Technology used:

Front End

• Jdk 1.6.0

• Netbeans 6.9.1

• Internet Explorer 6.0/above

Back-End

• Mysql 5.1

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 15


Chapter 5

PROJECT PLAN

5.1 PROJECT ESTIMATES

5.1.1 RECONCILED ESTIMATES

The completed project may significantly differ from the planned tasks and projected conditions upon
which the initial estimate was based. The initial estimate must be adjusted to account for these differ-
ences if a meaningful comparison between the estimate and the actual project effort is be established.
The purpose of the Reconciliation Advisor is to recalculate the estimated effort using the actual statistics
and results from the completed project. The Reconciliation Advisor gathers actual project data through
a question and answer process similar to that used when the system requirements were gathered for the
initial estimate. The questions differ only in that the past tense is used: ”Did the application replace a
mission critical or line of business process?” instead of Does the application replace a mission critical or
line of business process?”.

COST ESTIMATE

A cost estimate is the approximation of the cost of a program, project, or operation. The cost estimate
is the product of the cost estimating process. The cost estimate has a single total value and may have
identifiable component values. This project provide less cost development because project uses open
source platform.

Time Estimates

1. Understand the Project Outcome: Project outcome must provide with in time up to may

16
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

2. Estimate Time: project estimation time is almost one year

3. Plan for it Going Wrong: Project testing process on march.

Financial Budge

Function Point Analysis

Function Point Analysis is to evaluate a system’s capabilities from a user’s point of view. To achieve
this goal, the analysis is based upon the various ways users interact with computerized systems. From a
user’s perspective a system assists them in doing their job by providing five basic functions.

Estimation Units - Project Effort

The total estimate is calculated in person months, which can easily be converted to other units of
effort using the following conversion factors:

1 Person Days/Person Month


Effort levels are based on a relative distribution factor - they determine how much of the task group
estimate will be distributed to each individual task. The mathematical formula to calculate the hours
estimate for each task in a task group based on the effort level for each task is: Task Hours Estimate =
employee Hours Estimate * (Task Effort Level/employee Effort)

5.2 Project Resources

H/W System Configuration :-

Processor - Pentium III

Speed - 1.1 Ghz

RAM - 256 MB(min)

Hard Disk - 20 GB

Floppy Drive - 1.44 MB

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 17


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Key Board - Standard Windows Keyboard

Mouse - Two or Three Button Mouse

Monitor - SVGA

S/W System Configuration:-


Operating System : Windows XP / 7

Front End
: JAVA,RMI, SWING.

5.3 RISK MANAGEMENTW.R.T. NP HARD ANALYSIS

Project plan and management activities deal with the planning of the project. The tasks are divided
amongst the project members. These tasks have to be executed or completed simultaneously to reduce
the time required. Also, project plan gives overview about what should be done at the particular time.

5.3.1 Schedule Risk

Project schedule get slip when project tasks and schedule release risks are not addressed properly. Sched-
ule risks mainly affect the project and may lead to project failure. Schedules often slip due to following
reasons:

• Wrong time estimation

• Resources are not tracked properly.

• Failure to identify complex functionalities and time required to develop those functionalities.

• Unexpected project scope expansions

5.3.2 Budget Risk

• Wrong budget estimation.

• Cost overruns

• Project scope expansion

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 18


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

5.3.3 Operational Risk

Causes of Operational risks:

• Failure to address priority conflicts

• Failure to resolve the responsibilities

• Insufficient resources

• No proper subject training

• No resource planning

• No communication in team.

5.3.4 Technical Risk

Technical risks generally lead to failure of functionality and performance.

Causes of technical risks are:

• Continuous changing requirements

• Product is complex to implement.

• Difficult project modules integration.

5.3.5 Project Risk

Initially applications have similar threats, vulnerabilities and risks to those posed by typical web and
client/server applications. That said, because users have the power and ability to download whatever
they wish and manage their devices to their liking, we need to think about these top five risks and how
to mitigate them:

Inherent, Blind Trust

App stores come pre-installed on our mobile devices and provide access to a ton of mobile applications.
We blindly trust that the app stores have performed due diligence on the apps in their stores. Yet, in
reality, app store vendors lack the cycles to ensure that the apps they make available won’t open up our
employees/users to risks that can harm the business.
Functional Risks

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 19


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Opening, editing, sending, receiving and e-mailing documents; syncing backups; checking in to my
current location; etc. - these are a tiny subset of tasks that I can complete with my devices. But what
happens if I open a PDF from my business e-mail into a PDF viewer that I downloaded? Suppose I then
sync that document to the PDF viewer? At this point, my potentially sensitive document is being man-
aged by someone else’s application (probably insecure application and sync storage), and it is completely
outside of my control. How about if I check in to my current location via Facebook or Foursquare? Due
to the sensitive nature of what I do, some of my clients don’t want others to know I am working for
them, But if I ”check in,” the whole world (literally) becomes aware of where I am.

Malware
Malware has forever been a problem in the IT world, and it is no different in the mobile sphere.
Malware can wreak havoc by stealing sensitive data, monitoring traffic, connecting to internal networks
and infecting internal machines. And that’s just for starters. Malware will continue to evolve in apps
from app stores, and attackers will continue to refine their approaches to malfeasance.
Root Applications
Rooting and jail breaking are commonplace. Users or attackers run exploits against the mobile oper-
ating system to provide them with unfettered access to the file system and allow them to be the ”root”
user of the operating system. Some users appreciate the freedoms that having root access gives them.
Root access also provides a gateway to other app stores, such as Cydia, or the ability to download ap-
plications from untrusted sources. The applications running as root deliver functional and malware risks
to the business. In some cases, the functional/malware line starts to get fuzzy with the root applications
because, typically, the applications provide more functionality than the typical non-root applications
provide.

Inappropriate Applications

Clearly, not all applications are appropriate in the workplace, and I’ll leave it to your imagination
to classify which ones would be classified as Not Safe For Work. The number of mobile applications has
gone from zero to 1.5 million in a little more than four years, and it will continue to grow in quantum
leaps. As the mobile app world continues to evolve, so will the risks. In next month’s posting, I will
discuss how to address each of these risks and provide specifics on how to thwart them.

5.4 PROJECT SCHEDULE

The project schedule starts from Aug 2015 and ends in March 2016

The Different Phases identified are:

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 20


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

1. Requirement Analysis.

2. Requirement Specification.

3. System Design.

4. Detailed Design.

5. Coding.

6. Testing.

The Gantt chart as shown below represents the approximate schedule followed for the completion of
each phase:

5.5 TEAM ORGANIZATION

Teams work in an organization to improve quality, complete projects and change processes. In this
project, team consist of three members.

5.5.1 Management reporting and communication

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 21


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 22


Chapter 6

SOFTWARE REQUIREMENT
SPECIFICATION

6.1 INTRODUCTION

6.1.1 Purpose and Scope of Document

The software requirement specification(SRS) is a communication tool between stakeholders software de-
signers.

Purposes of SRS are:

Facilitating reviews

Describing the scope of work

Providing a framework for testing primary and secondary use cases

Providing platform for ongoing refinement

6.1.2 Overview of responsibilities of Developer

Responsibilities of Developer:

1. Provide better project design and analysis

2. Provide error free program

23
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

3. Provide better time complexity

4. Provide user friendly project

5. Provide accuracy of project

6. Provide easy handling of project

7. Provide security for user privacy

6.2 USAGE SCENARIO

Manual and Automated test are the types of software testing. We are doing a manual test for testing our
system that is without using any automated tool or any script. In this type tester takes over the role of an
end user and test the software to identify any unexpected behavior or bug. There are different stages for
manual testing like unit testing, integration testing, system testing and user acceptance testing. Testers
use test plan, test cases or test scenario to test the software to ensure the completeness of a testing.
Manual testing also includes exploratory testing as a testers explore the software to identify the errors
in it. Automation testing which is also known as Test Automation is when the tester writes scripts and
uses software to test the software. This process involves automation of a manual process. Automation
Testing is used to re-run the test scenarios that were performed manually, quickly and repeatedly.

6.2.1 User profiles

Actors may represent roles played by human users,external hardware or other subjects.

There two types of actors

1 User

2 Admin/server

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 24


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

6.2.2 Use Case

Figure 6.1: Use case

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 25


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Figure 6.2: Use case

6.3 Functional Model and Description

6.3.1 Data Flow Diagram

Level 0 Data Flow Diagram

Level 1 Data Flow Diagram

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 26


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Figure 6.3: Data Flow Diagram 0

Figure 6.4: DATA FLOW DIAGRAM 1

6.3.2 Activity Diagram:

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 27


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Figure 6.5: User Activity Diagram

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 28


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Figure 6.6: Admin Activity Diagram

6.3.3 Non Functional Requirements:

• Accessibility

• Capacity, current and forecast

• Compliance

• Documentation

• Disaster recovery

• Efficiency

• Effectiveness

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 29


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

• Extensibility

• Fault tolerance

• Interoperability

• Maintainability

• Privacy

• Portability

• Quality

• Reliability

• Resilience

• Response time

• Robustness

• Scalability

• Security

• Stability

• Supportability

• Testability

6.3.4 State Diagram

A state diagrams how a state machine, consisting of states, transitions, events, and activities. You use
state diagrams to illustrate the dynamic view of a system. They are especially important in modelling the
behaviour of an interface, class, or collaboration. State diagrams emphasize the event-ordered behaviour
of an object, which is especially useful in modelling reactive systems

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 30


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Figure 6.7: state Daigram

6.3.5 Design Constraints

Before you start drafting your design considerations, you must know the outcomes of your design oppor-
tunity. Some design opportunities are inclined to an improvement of certain functions or safety. Some
is targeted at a specific target audience. Some aims to solve a bugging problem that seems to have no
viable solutions at the present. Some are proposed to create fun while using.

Knowing precisely what you want out of your design proposal helps a lot in drafting a good set
of design considerations. Because that will mean you will be very spot on in identifying the areas of
considerations. Otherwise yours will simplistically be stating the obvious universal areas like, products
should be safe for users and must look good, must be colorful, etc.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 31


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

6.3.6 Software Interface Description

Technologies and tools used in Policy system project are as follows

Technology used:
Front End

• Jdk 1.6.0

• Netbeans 6.9.1

• Eclipse

• Internet Explorer 6.0/above

Back-End

• Mysql 5.1

Memory constraints
Basically initial softwares will use around 20 GB on hard drive, and the our actual application will
be around 300 MB data. When we deploy the application on web server we will assume the 1 GB space
for website and 500 MB for database.

The key considerations of Java are


1. Object Oriented: Java purists everything is an object paradigm. The java object model is simple
and easy to extend.

2. Multithreaded: Java supports multithreaded programming which allows you to write programs
that do many things simultaneously.

3. Architecture-Neutral: the main problem facing programmers is that no guarantee that if you write
a program today, it will run tomorrow, even on the same machine. Java language and JVM solves this
problem, their goal is to provide Write once; Run anywhere anytime forever.

4. Distributed: Java is designed for the distributed environment of the Internet because it handles
TCP/IP protocols.

The minimum requirements the client should have to establish a connection to a server are as follows:
Processor: Pentium III

Ram : 128MB

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 32


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Hard Disk: 2 GB

Web server: Apache Tomcat 1.7 Java Web Server

Protocols: TCP/IP

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 33


Chapter 7

DETAILED DESIGN DOCUMENT USING


APPENDIX A AND B

7.1 INTRODUCTION

Page fetcher fetches the pages from the http server and sends them to the page analyzer to check
whether it is the required appropriate page or not based on the topic of search and the kind of form, the
page contains.

After filling the form, the form submitter sends again the request to the HTTP server.

Crawl Frontier contains all the links which are yet to be fetched from the HTTP server or the links
obtained after URL filter. It takes a seed URL for starting the procedure and processes that page and
retrieves all the forms and adds and rearranges them to the list of URL and rearranges them. The list
of those URL is called crawl frontier.

Link Extractor extracts the links or the hyper links from the text file for the further retrieval from the
HTTP server. The extraction of links is done as per the link identified by the page analyzer/parser that
is likely to lead to pages that contain searchable form interfaces in one or more steps. This dramatically
reduces the quantity of pages for the crawler to crawl in deep web. Fewer numbers of pages are needed
to be crawled since it applies focused crawling along with searching the relevancy of obtained result to
the topic and hence result in limited extraction of relevant links.

This module is responsible for providing the best match query word assignment to the form filling pro-
cess and reduce the over loading due to less relevant queries to the domain. It is required to rank the link
accordingly so that more information is gathered from each link. This is based on link ranking algorithm.

This module plays an important role in the indexing of the generated keywords to the content

34
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

database. Indexer collects, parses, and stores data to facilitate fast and accurate information retrieval.
It maps the keywords to URL for the fast access and retrieval.

Content Data Base stores all the generated links or keywords in the Content Data Base. When user
puts any query into the user interface, the index is matched with the corresponding links and information
is displayed to the user for further processing.

This module works when the information in a site is hidden behind the authentication form. It stores
authentication credentials of every domain in its knowledge base situated at crawler, which is provided
by the individual user. At the time of crawling, it automatically authenticates itself on the domain for
crawling the hidden web contents. The crawler extracts and stores keywords from contents and makes
it available to privatized search service to maintain the privacy issue. It provides the searching interface
through which user places the query. This involves the searching of keywords and other information
stored in the content database which is actually stored in it after the whole process of authenticated
crawling.

This module takes the reference from the Query word to URL Manager for the purpose of form
submission.

Interface generator is used to give the view of the contents stored in the content database after the
search is completed. For example, the interface generator shows the list of relevant links indexed and
ranked by link ranker module and link indexer module respectively. Link event analyzer analyzes the link
which is activated by the user so that it could forward the request to display the page on the requested
URL.

It is responsible for mapping the login credentials provided by the user to the crawler with the
information provider domain. The main benefit of using this mapper is to overcome the hindrance of
information retrieval between result link and information. The crawler uses the mapping information to
allow the specific person to receive information contents directly from the domain by automatic login
procedure and eliminates the step of separate login for user.

7.2 ARCHITECTURAL DESIGN

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 35


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Figure 7.1: Architectural Design Diagram

7.3 COMPOENT DESIGN

7.3.1 ClassDiagram

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 36


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Figure 7.2: Class Diagram

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 37


Chapter 8

PROJECT IMPLEMENTATION

8.1 INTRODUCTION

In above chapters, it was identified that the following issues were needed to be addressed while designing
an effective deep web crawler to extract the deep web information contents.

1. Query words that are collected in knowledge base must be done in optimized, effective and inter-
operative manner.

2. Effective context identification for recognizing the type of domain.

3. For the purpose of extracting the deep web information, a crawler should not overload any of the
web server.

4. Overall extraction mechanism of deep web crawling must be compatible, flexible, inter-operative
and configurable with reference to existing web server architecture.

5. There must be a procedure to define the number of query words to be used to extract the deep web
data with element value assignments. For example date wise fetching mechanism can be implemented
for limiting the number of query words.

6. Architecture of deep web crawler must be based on global level value assignment,

so that inter-operative element value assignment can be used with it. However, for element value
assignment, surface search engine and clustering of query word can be done based on ontology as well as
quantity of keywords found on surface search engine.

7. For sorting and ranking, the process of query word information extracted from deep web must

38
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

be analyzed and resolved through two major activities i.e. exact context identification and some rank-
ing procedure so that enrichment of collection in knowledge base at the search engine could be carried out.

8. The deep web crawler must be robust against unfavorable conditions. For example if resource is
removed after the crawling process, it is impossible to set exact result pages of the specific query words,
even if the existing record of relation between url and query word is available.

8.2 TOOLS AND TECHNOLOGIES USED

In todays scenario internet is changing in a way of presentation to social connectivity. Website ana-
lytics is the process of measurement, collection, analysis and reporting of internet data for purposes of
understanding and optimizing web use. Website analytics is not only used for the estimating the website
traffic but can also be used for commercial applications. Results of website analytics applications can
also be very helpful for companies for advertising their products. Website analytics collects and analyzes
information about the number of visitors, page views etc of the websites which can be further utilized
for various purposes.

Web analytics tools conventionally have been used to help website owners to study more about clients
online behaviour in order to improve website structural design and online marketing actions. Most of
todays web analytics solutions offer a range of analytical and statistical data, ranging from fundamental
traffic reporting to individual-level data that can be linked with price, response and profit data. Under-
standing web navigational data is crucial tasks for web analysts as it can influence the website upgrading
process. There are two broad categories of website analytics i.e. off-site and on-site website analytics.
Off-site website analytics refers to web measurement and analysis of website irrespective of whether one
own or maintain a website. It includes the measurement of a website’s prospective audience (oppor-
tunity), share of voice (visibility), and buzz (comments) that is happening on the internet as a whole.
On-site website analytics measure a visitor’s journey once on the website. On-site website analytics
requires the use of drivers and conversions. One illustrations of this process can be to know that which
landing pages facilities the subscribers for purchasing.

8.3 METHODOLOGIES/ALGORITHM DETAILS

The algorithm for the simple web crawler is given below:

Stop Word Removal:

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 39


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Frequently occurred words, like pronouns, prepositions and conjunctions in English e.g. it, in, and,
etc. are known as stop words. These words from the text documents are having a very low discriminative
value. It includes creating a list of stop words and then scanning the tokens to remove the stop words
occurred.

Algorithm 1: Stop word Removal Approach

Input: Stop words list L[], String Data D for remove the stop words.

Output: Verified data D with removal all stop words.

Step 1: Initialize the data string S[].

Step 2: initialize a=0,k=0

Step 3: for each(read a to L)

If(a.equals(L[i]))

Then Remove S[k]

End for

Step 4: add S to D.

Step 5: End Procedure

Stemming:
It is the process of finding the root word of the token. For example, the words purification, purity,
purify and purifying having stemmed root as pure. Stemming words helps to reduce the dimensionality
of the feature space. The Porter stemming algorithm is used, which is a natural language processing
(NLP) by removing the suffix, to narrow down the size of the feature space.

Algorithm 2: Stemmin Approach


Input : Word w

Output : w with removing past participles as well.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 40


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Step 1: Initialize w

Step 2: Intialize all steps of Porter stemmer

Step 3: for each (Char ch from w)

If(ch.count == w.length()) (ch.equals(e))

Remove ch from(w)

Step 4: if(ch.endswith(ed))

Remove ed from(w)

Step 5: k=w.length()

If(k (char) to k-3 .equals(tion))

Replace w with te.

Step 6: end procedure

Algorithm 3: K-Means Clustering Algorithm


Step 1: Read web history from dataset D.

Step 2: The D can hold lot of comments like it is a set of Comments c.

Step 3: for each(c from D till when reach Null)

Step 4: set all cluster outbound set value.

Step 5: if(C!+null)

Check it is related to relevant cluster from existing clusters.

Else

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 41


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Create new cluster with id

Step 6: if (current cluster is not related to existing clusters)

Then add to default cluster

Step 7: repeat step 3 again

Step 8: end for

Step 9: classify all clusters.

Step 10: set the proper category to each cluster

Step 11: predict the relevant category for current query.

Procedure

Step 1: Choose variety of clusters.

Step 2: Assign every which way to every purpose coefficients for being within the clusters.

Step 3: Repeat till the rule has converged (that is, the coefficients’ modification between 2 iterations
isn’t any quite, the given sensitivity threshold):

Step 4: Compute the center of mass for every cluster, mistreatment the formula on top of

Step 5: For every purpose, work out its coefficients of being within the clusters, mistreatment the
formula on top.

8.4 VERIFICATION AND VALIDATION FOR ACCEPTANCE

Verification is required because it gives the answer of this question Verify the input given by user,is it
giving result right?

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 42


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Validation is required because it gives the answer of this question,given input is valid or not and is it
giving right result?

It validates the input or query given by user it may keyword or it may be phrase

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 43


Chapter 9

SOFTWARE TESTING

9.1 TYPE OF TESTING USED

1 Manual Test:
Manual and Automated test are the types of software testing. We are doing a manual test for testing
our system that is without using any automated tool or any script. In this type tester takes over the
role of an end user and test the software to identify any unexpected behavior or bug. There are different
stages for manual testing like unit testing, integration testing, system testing and user acceptance testing.

Testers use test plan, test cases or test scenario to test the software to ensure the completeness of a
testing. Manual testing also includes exploratory testing as a testers explore the software to identify the
errors in it.

2 Automated Test:
Automation testing which is also known as Test Automation is when the tester writes scripts and
uses software to test the software. This process involves automation of a manual process. Automation
testing is used to re-run the test scenarios that were performed manually, quickly and repeatedly.

9.2 TEST CASES AND TEST RESULTS

Login:

44
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Indexing:

Se:

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 45


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Login Test:

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 46


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Registration Page:

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 47


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 48


Chapter 10

RESULTS

10.1 INPUT SCREEN SHOTS

10.2 OUTPUTS

49
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

1. Query words that are collected in knowledge base must be done in optimized, effective and inter-
operative manner.

2. Effective context identification for recognizing the type of domain.

3. For the purpose of extracting the deep web information, a crawler should not overload any of the
web server.

4. Overall extraction mechanism of deep web crawling must be compatible, flexible, inter-operative
and configurable with reference to existing web server architecture.

5. There must be a procedure to define the number of query words to be used to extract the deep web
data with element value assignments. For example date wise fetching mechanism can be implemented
for limiting the number of query words.

6. Architecture of deep web crawler must be based on global level value assignment,

so that inter-operative element value assignment can be used with it. However, for element value
assignment, surface search engine and clustering of query word can be done based on ontology as well as
quantity of keywords found on surface search engine.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 50


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

7. For sorting and ranking, the process of query word information extracted from deep web must
be analyzed and resolved through two major activities i.e. exact context identification and some rank-
ing procedure so that enrichment of collection in knowledge base at the search engine could be carried out.

8. The deep web crawler must be robust against unfavorable conditions. For example if resource is
removed after the crawling process, it is impossible to set exact result pages of the specific query words,
even if the existing record of relation between url and query word is available.

9. The crawler must have a functionality tool to enable the quicker detection of failure, error or
unavailability of any resources.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 51


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 52


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 53


Chapter 11

DEPLOYMENT AND MAINTENAN

11.1 INSTALLATION AND UN-INSTALLATION

Steps to install JDK 7 in Windows 8


1. Find if Windows 8 is 32-bit or 64-bit.
2. Download correct JDK 7 installer from Java download sites
Go to Java SE download site http://www.oracle.com/technetwork/java/javase/downloads/index.html
and select Java Platform (JDK) 7u13, which is latest Java SE release.

Install JDK by double clicking on Windows installer


Once you download correct JDK installer, rest of installation is like installing any other Windows based
application. Just follow Instruction given by Java SE Installation Wizard. Good thing is that now
JavaFX is included as part of JDK 7, so you don’t need to separately install JavaFX.

Include JDK bin directory in Windows 8 PATH environment variable

Installation of MySQL Server

Unzip the setup file and execute the downloaded MSI file. Follow the instructions below exactly when
installing MySQL Server:

54
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

If you checked the Configure the MySQL Server now check box on the final dialog of the MySQL
Server installation, then the MySQL Server Instance Configuration Wizard will automatically start. Fol-
low the instructions below carefully to configure your MySQL Server to run correctly with Event Sentry.

Select the drive where the database files will be stored. Select the drive on the fastest drive(s) on
your server

It is recommended that you leave the default port 3306 in place, however Event Sentry will also work
with non-standard ports if necessary

Install Eclipse on Your Windows Download Eclipse Software: Now we need to download Eclipse
software. You can download eclipse software by visiting this page. You will see two versions of eclipse.
That is, Eclipse Standard and Eclipse IDE for Java EE Developers.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 55


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Extract the downloaded Eclipse Zip folder to install eclipse. Here we dont get any setup file to install.
All we have to do is just extract the zip folder and configure eclipse.ini file to start using it. You can
extract this eclipse folder on any location you want. Once you extract that folder, open eclipse folder.
In that, you will see eclipse.ini file. Open that file.

Open Eclipse and set the workplace where you want store the program files n set jdk path and create
a project.

11.2 USER HELP

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 56


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 57


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 58


Chapter 12

CONCLUSION AND FUTURE SCOPE

CONCLUSION
In this paper we have survey different kind of general searching technique and Meta search engine strat-
egy and by using this we have proposed an effective way of searching most relevant data from hidden
web. In this we are combining Multiple search engine and two stage crawler for harvesting most relevant
site. By using page ranking on collected sites and by focusing on a topic, advanced crawler achieves
more accurate results. The two stage crawling performing site locating and in-site exploration on the
site collected by Meta crawler.

Deep web has a very large volume of quality information compared to surface web. Extraction of
deep web information can be highly fruitful for a general user or a specific user. Traditional web crawlers
have limitations in crawling the deep web information so some of the web crawlers are specially designed
for crawling the deep web information yet a very large amount of deep web information is yet to be
explored due to inefficient crawling of the deep web.

In this work, literature survey and analysis of some of the important deep web crawlers was carried
out to find their advantages and limitations. A comparative analysis of existing deep web crawlers was
also carried out on the basis of various parameters and it is concluded that a new architecture for deep
web crawler was required for efficient searching of the deep web information by minimizing the limitations
of the existing deep web crawlers as well as incorporating the strengths of the existing deep web crawlers.

To resolve this issue in the proposed architecture, the query word selection takes place according to
the context identification as well as verification of the web page crawled by crawler. This feature also
facilitates the identification of new query words.

There is a big challenge to access the relevant information through hidden web by identifying a rele-
vant domain among the large number of domains present on World Wide Web.

59
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

For resolving this issue, similar domains are identified according to context and query interfaces and
categorized at element level so that a large number of deep web domains can be crawled with proposed
architecture.

A crawler should not affect the normal functioning of website during crawling process by over bur-
dening the website. Sometimes a crawler with a high throughput rate may results in denial of service.

The proposed deep web crawler uses a server to fetch query words which are sorted according to
domain context. So only the high ranking query words are used to extract the deep web data which
reduces the overall burden on the server.

An efficient hidden web crawler should automatically parse, process and interact with the form-based
search interfaces. A general PIW crawler only submit the request for the url to be crawled where as
a hidden web crawler also supply the input in the form of search queries in addition to the request for url.

Many websites presents query form to the users to further access their hidden database. The general
web crawler cannot access these hidden databases due to the inability to process these query forms..

Many websites are equipped with client-side scripting languages and session maintenance mecha-
nisms. A general web crawler cannot access these types of pages due to inability to interact with such
type of web pages [12].

For handling these types of pages we implemented authenticated crawling, which monitor session and
crawl authenticated contents.

As web crawlers becomes increasingly popular, efficient searching of web pages , such as page ranking
to compare most searched pages and web sites for efficient and fast also accurate data finding is become
essay. The performance of these operations directly affects the usability of the benefits offered by smart
web crawlers

FUTURE SCOPE

Some of the issues that can be further explored or extended are given as follows:
Implementation of distributed deep web crawler
As the size of the deep web is very large and is continuously growing, so deep web crawling can be
made more effective by employing distributed crawling.

Updation of deep web pages

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 60


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

The deep web information is frequently updated. Therefore the deep web crawler should be made
more effective by compensating this issue by employing incremental deep web crawling technique.
Indexing of the deep web pages
Search engine maintain the large scale inverted indices. Thus further work can be done in the direction
of efficient maintenance of indices of deep web.
Extension the coverage the crawlable form
As the deep web exist for variety of domains, the range of crawlable forms can be extended further
to make deep web crawling more efficient.
Improvement of content processing and storage mechanism
As the deep web information is very large so further improvement can be done in deep web crawling
by employing improved deep web data processing and storage mechanism

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 61


Chapter 13

REFERENCES
[1] Olston and M. Najork , Web Crawling, Foundations and Trends in Information Retrieval, vol. 4,
No. 3 ,pp. 175246, 2010

[2] M. Burner, Crawling towards Eternity: Building an Archive of the World Wide Web, Web Tech-
niques Magazine, vol. 2, pp. 37-40, 1997.

[3] Allan Heydon and Marc Najork. Mercator: A scalable, extensible web crawler. World Wide Web
Conference, 2(4):219229, April 1999.

[4] Jenny Edwards, Kevin S. McCurley, and John A. Tomlin. An adaptive model for optimizing
performance of an incremental web crawler. In Proceedings of the Tenth Conference on World Wide
Web, pages 106113, Hong Kong, May 2001. Elsevier Science.

[5] Luciano Barbosa and Juliana Freire. Searching for hidden-web databases.In WebDB, pages 16,
2005.

[6] Luciano Barbosa and Juliana Freire. An adaptive crawler for locating hidden-web entry points.
In Proceedings of the 16th international conference on World Wide Web, pages 441450. ACM, 2007.

[7] Soumen Chakrabarti, Martin Van den Berg, and Byron Dom. Focused crawling: a new approach
to topic-specific web resource discovery. Computer Networks, 31(11):16231640, 1999.

[8] Kevin Chen-Chuan Chang, Bin He, and Zhen Zhang. Toward large scale integration: Building a
metaquerier over databases on the web. In CIDR, pages 4455, 2005.

[9] Denis Shestakov. Databases on the web: national web domain survey. In Proceedings of the 15th

62
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Symposium on International Database Engineering Applications, pages 179184. ACM, 2011.

[10] Denis Shestakov and Tapio Salakoski. Host-ip clustering technique for deep web characterization.
In Proceedings of the 12th Internationa Asia-Pacific Web Conference (APWEB), pages 378380. IEEE,
2010.

[11] Denis Shestakov and Tapio Salakoski. On estimating the scale of national deep web. In Database
and Expert Systems Applications, pages 780789. Springer, 2007.

[12] Michael K. Bergman. White paper: The deep web: Surfacing hidden value. Journal of electronic
publishing, 7(1), 2001.

[13] Shestakov Denis. On building a search interface discovery system. In Proceedings of the 2nd
international conference on Resource discovery,pages 8193, Lyon France, 2010. Springer.

[14]Bright planets searchable database directory. http://www.completeplanet.com/, 2013.

[15] Y. Wang, T. Peng, W. Zhu, Schema extraction of Deep Web Query Interface , IEEE Transaction
On Web Information Systems and Mining, WISM International Conference 2009

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 63


ANNEXURE A

Laboratory assignments on Project Analysis of Algorithmic Design.


Mathematical module using set theory
A] Identify the set of features

S = {f1, f2, f3...}

Where F is main module which includes the subset of f1, f2, f3...

B] Identify the set of form data called as documents

DC = { d1, d2, d3...} Where DC is main set of document which include the sub form of d1, d2, d3...
C] Identify the set of root words I = { i1, i2, i3 } Where I is main set of cipher i1, i2, i3 D] Identify the
set of stop words. O = {o1, o2, o3...} Where O is main set of sniffets o1, o2, o3... E] Identify the set
of class detection. CD = {cd1, cd2, cd3...} Where CD is main set of class detection cd1, cd2, cd3... as
positive and negative words F] Identify the processes as P.

P = Set of processes

P = {P1, P2, P3,P4...} Where,


J] Identify failure cases as FL

Failure occurs when

FL= { F1,F2,F3... } 1. F1= f j f if the data is not distributed in chunks

K] Identify success case SS:-

Success is defined as -

SS= { S1,S2,S3,S4 }

64
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

1. S1=Data collection module

2. S2=data analysis using different algorithm

3. S3=data classification

4. S4=analysis and results

I] Initial conditions as I0

1. User wants to provide the online dataset.

2. User selects the features.

NP Hard

When solving problems we have to decide the difficulty level of our problem. There are three types
of classes provided for that. These are as follows:

1) P Class

2) NP-hard Class

3) NP-Complete Class

P
Informally the class P is the class of decision problems solvable by some algorithm within a number of
steps bounded by some fixed polynomial in the length of the input. Turing was not concerned with the ef-
ficiency of his machines, but rather his concern was whether they can simulate arbitrary algorithms given
sufficient time. However it turns out Turing machines can generally simulate more efficient computer
models (for example machines equipped with many tapes or an unbounded random access memory) by
at most squaring or cubing the computation time. Thus P is a robust class and has equivalent definitions
over a large class of computer models. Here we follow standard practice and define the class P in terms
of Turing machines.

NP-hard
A problem is NP-hard if solving it in polynomial time would make it possible to solve all problems in
class NP in polynomial time. Some NP-hard problems are also in NP (these are called ”NP-complete”),

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 65


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

some are not. If you could reduce an NP problem to an NP-hard problem and then solve it in polyno-
mial time, you could solve all NP problems. Also, there are decision problems in NP-hard but are not
NP-complete, such as the infamous halting problem

NP-complete
A decision problem L is NP-complete if it is in the set of NP problems so that any given solution to
the decision problem can be verified in polynomial time, and also in the set of NP-hard problems so that
any NP problem can be converted into L by a transformation of the inputs in polynomial time.

The complexity class NP-complete is the set of problems that are the hardest problems in NP, in the
sense that they are the ones most likely not to be in P. If you can find a way to solve an NP-complete
problem quickly, then you can use that algorithm to solve all NP problems quickly.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 66


ANNEXURE B

Laboratory assignments on Project Quality and Reliability Testing of Project Design As-
signment 1
System implementation Plan and Cost Estimation
Risk Management
Project plan and management activities deal with the planning of the project. The tasks are divided
amongst the project members. These tasks have to be executed or completed simultaneously to reduce
the time required. Also, project plan gives overview about what should be done at the particular time.

Schedule Risk

Project schedule get slip when project tasks and schedule release risks are not addressed properly.
Schedule risks mainly affect the project and may lead to project failure. Schedules often slip due to
following reasons:

• Wrong time estimation

• Resources are not tracked properly.

• Failure to identify complex functionalities and time required to develop those functionalities.

• Unexpected project scope expansions

Budget Risk:

• Wrong budget estimation.

• Cost overruns

• Project scope expansion

Operational Risk:

Causes of Operational risks:

• Failure to address priority conflicts

67
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

• Failure to resolve the responsibilities

• Insufficient resources

• No proper subject training

• No resource planning

• No communication in team.

Technical Risk:

Technical risks generally lead to failure of functionality and performance. Causes of technical risks
are:

• Continuous changing requirements

• Product is complex to implement.

• Difficult project modules integration.

Programmatic Risk

These are the external risks beyond the operational limits. These are all uncertain risks are outside
the control of the program. These external events can be:

• Running out of fund.

• Market development

• Government rule changes.

Project Risks

Initially applications have similar threats, vulnerabilities and risks to those posed by typical web
and client/server applications. That said, because users have the power and ability to download
whatever they wish and manage their devices to their liking, we need to think about these top five
risks and how to mitigate them:

1. Inherent, Blind Trust

App stores come pre-installed on our mobile devices and provide access to a ton of mobile appli-
cations. We blindly trust that the app stores have performed due diligence on the apps in their
stores. Yet, in reality, app store vendors lack the cycles to ensure that the apps they make available
won’t open up our employees/users to risks that can harm the business.

2. Functional Risks

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 68


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Opening, editing, sending, receiving and e-mailing documents; syncing backups; checking in to my
current location; etc. - these are a tiny subset of tasks that I can complete with my devices. But
what happens if I open a PDF from my business e-mail into a PDF viewer that I downloaded?
Suppose I then sync that document to the PDF viewer? At this point, my potentially sensitive
document is being managed by someone else’s application (probably insecure application and sync
storage), and it is completely outside of my control. How about if I check in to my current location
via Facebook or Foursquare? Due to the sensitive nature of what I do, some of my clients don’t
want others to know I am working for them, But if I ”check in,” the whole world (literally) becomes
aware of where I am.

1. Malware

Malware has forever been a problem in the IT world, and it is no different in the mobile sphere.
Malware can wreak havoc by stealing sensitive data, monitoring traffic, connecting to internal
networks and infecting internal machines. And that’s just for starters. Malware will continue to
evolve in apps from app stores, and attackers will continue to refine their approaches to malfeasance.

2. Root Applications

Rooting and jail breaking are commonplace. Users or attackers run exploits against the mobile
operating system to provide them with unfettered access to the file system and allow them to be
the ”root” user of the operating system. Some users appreciate the freedoms that having root
access gives them. Root access also provides a gateway to other app stores, such as Cydia, or the
ability to download applications from untrusted sources. The applications running as root deliver
functional and malware risks to the business. In some cases, the functional/malware line starts to
get fuzzy with the root applications because, typically, the applications provide more functionality
than the typical non-root applications provide.

3. Inappropriate Applications

Clearly, not all applications are appropriate in the workplace, and I’ll leave it to your imagination
to classify which ones would be classified as Not Safe For Work. The number of mobile applications
has gone from zero to 1.5 million in a little more than four years, and it will continue to grow in
quantum leaps. As the mobile app world continues to evolve, so will the risks. In next month’s
posting, I will discuss how to address each of these risks and provide specifics on how to thwart them

Project Schedule

The project schedule starts from Aug 2015 and ends in March 2015

The Different Phases identified are:

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 69


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

1. Requirement Analysis.

2. Requirement Specification.

3. System Design.

4. Detailed Design.

5. Coding.

6. Testing.

The Gantt chart as shown below represents the approximate schedule followed for the completion
of each phase:

Figure 13.1: Plan OF Project Execution

Financial Budget Function Point Analysis

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 70


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Function Point Analysis is to evaluate a system’s capabilities from a user’s point of view. To achieve
this goal, the analysis is based upon the various ways users interact with computerized systems. From a
user’s perspective a system assists them in doing their job by providing five basic functions.

Reconciled Estimation
The completed project may significantly differ from the planned tasks and projected conditions upon
which the initial estimate was based. The initial estimate must be adjusted to account for these differences
if a meaningful comparison between the estimate and the actual project effort is be established. The
purpose of the Reconciliation Advisor is to recalculate the estimated effort using the actual statistics
and results from the completed project. The Reconciliation Advisor gathers actual project data through
a question and answer process similar to that used when the system requirements were gathered for the
initial estimate. The questions differ only in that the past tense is used: ”Did the application replace a
mission critical or line of business process?” instead of Does the application replace a mission critical or
line of business process?”. Estimation Units - Project Effort
The total estimate is calculated in person months, which can easily be converted to other units of
effort using the following conversion factors:

1 Person Days/Person Month


Effort levels are based on a relative distribution factor - they determine how much of the task group
estimate will be distributed to each individual task. The mathematical formula to calculate the hours
estimate for each task in a task group based on the effort level for each task is: Task Hours Estimate =
employee Hours Estimate * (Task Effort Level/employee Effort)
Non functional Requirements:
This one shouldnt come as a surprise. Quality software has to be fast. Or at least feel fast. As
a front-end guy, this is the one I always feel first. Its not fast enough is a battle I never want to get
into. I had this reported as a bug against one of my projects, but the client wouldnt specify what fast
enough was; talk about moving goalposts. When you think about an app being performant, thing about
specifying the following:

1. Response times
How long should your app take to load? What about screen refresh times or choreography? 2.
Processing times
Can I get as spinning beachball please? How long is acceptable to perform key functions or export /
import data? 3. Query and Reporting times
This could be covered off with general reporting times, but if youre providing an API you should
probably consider acceptable query times too.
4. Throughput
Think about how many transactions your system needs to handle. A thousand a day? A million?

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 71


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

When Amazon solved this for their needs, they decoupled systems and created a queue service that
became the foundation of AWS.

5. Storage
How much data are you going to need to store to do the awesomeness you need it to do? 6. Growth
Requirements
This is a tough one, because truly you dont know how popular your app is going to be until its out
there. But you can bet (or hope) that someone has made predictions about how wildly successful your
app is going to be. Be wary of over engineering here, but at least make sure you arent constantly laying
down track in front of a moving train

7. Hours of operation
When does your app need to be available? If you need to do a database upgrade or a system backup,
can you take the system offline while you do it?
8. Locations of operation
A few things to think about here: Geographic location, connection requirements and the restrictions
of a local network prevail. If you are building a killer app for use behind the corporate firewall, youd
better make damn sure you arent using any exotic ports.
Developent Plan

PROJECT SCHEDULE
The project schedule starts from June 2015 and ends in march 2015

The Different Phases identified are:

1. Requirement Analysis.

2. Requirement Specification.

3. System Design.

4. Detailed Design.

5. Coding.

6. Testing.

Assignment 2

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 72


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Mathematical module using set theory

A] Identify the set of features

S = {f1, f2, f3...}

Where F is main module which includes the subset of f1, f2, f3...

B] Identify the set of form data called as documents

DC = { d1, d2, d3...}

Where DC is main set of document which include the sub form of d1, d2, d3...

C] Identify the set of root words

I = { i1, i2, i3 } Where I is main set of cipher i1, i2, i3

D] Identify the set of stop words.

O = {o1, o2, o3...}

Where O is main set of sniffets o1, o2, o3...

E] Identify the set of class detection.

CD = {cd1, cd2, cd3...}

Where CD is main set of class detection cd1, cd2, cd3... as positive and negative words F] Identify
the processes as P.

P = Set of processes

P = {P1, P2, P3,P4...}

Where,

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 73


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

J] Identify failure cases as FL

Failure occurs when

FL= F1,F2,F3...

1. F1= f j f if the data is not distributed in chunks

K] Identify success case SS:-

Success is defined as -

SS= { S1,S2,S3,S4 }

1. S1=Data collection module

2. S2=data analysis using different algorithm

3. S3=data classification

4. S4=analysis and results

I] Initial conditions as I0

1. User wants to provide the online dataset.

2. User selects the features.

NP Hard

When solving problems we have to decide the difficulty level of our problem. There are three types
of classes provided for that. These are as follows:

1) P Class

2) NP-hard Class

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 74


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

3) NP-Complete Class

P
Informally the class P is the class of decision problems solvable by some algorithm within a number of
steps bounded by some fixed polynomial in the length of the input. Turing was not concerned with the ef-
ficiency of his machines, but rather his concern was whether they can simulate arbitrary algorithms given
sufficient time. However it turns out Turing machines can generally simulate more efficient computer
models (for example machines equipped with many tapes or an unbounded random access memory) by
at most squaring or cubing the computation time. Thus P is a robust class and has equivalent definitions
over a large class of computer models. Here we follow standard practice and define the class P in terms
of Turing machines.

NP-hard
A problem is NP-hard if solving it in polynomial time would make it possible to solve all problems in
class NP in polynomial time. Some NP-hard problems are also in NP (these are called ”NP-complete”),
some are not. If you could reduce an NP problem to an NP-hard problem and then solve it in polyno-
mial time, you could solve all NP problems. Also, there are decision problems in NP-hard but are not
NP-complete, such as the infamous halting problem

NP-complete
A decision problem L is NP-complete if it is in the set of NP problems so that any given solution to
the decision problem can be verified in polynomial time, and also in the set of NP-hard problems so that
any NP problem can be converted into L by a transformation of the inputs in polynomial time.

The complexity class NP-complete is the set of problems that are the hardest problems in NP, in the
sense that they are the ones most likely not to be in P. If you can find a way to solve an NP-complete
problem quickly, then you can use that algorithm to solve all NP problems quickly.
Assignment 3
Assumptions and Dependency
We assume that there are several servers and clients attached to it.User system supports TCP/IP
protocols.

The key considerations of Java are

5. Object Oriented: Java purists everything is an object paradigm. The java object model is simple
and easy to extend.

6. Multithreaded: Java supports multithreaded programming which allows you to write programs

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 75


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

that do many things simultaneously.

7. Architecture-Neutral: the main problem facing programmers is that no guarantee that if you write
a program today, it will run tomorrow, even on the same machine. Java language and JVM solves this
problem, their goal is to provide Write once; Run anywhere anytime forever.

8. Distributed: Java is designed for the distributed environment of the Internet because it handles
TCP/IP protocols.

The minimum requirements the client should have to establish a connection to a server are as follows:

Processor: Pentium III

Ram : 128MB

Hard Disk: 2 GB

Web server: Java Web Server(Apache Tomcat)

Protocols: TCP/IP

External Interface Requirements :-

User Interfaces

This includes GUI standards, error messages for invalid inputs by users, standard buttons and func-
tions that will appear on the screen.

Hardware Interfaces We use TCP/IP protocol for establishing connection and transmitting data
over the network. We use Ethernet for LAN.

Software Interfaces
We use Oracle for storing database of clients who connects to the server through JDBC and ODBC.
Security Requirements

We provide authentication and authorization by passwords for each level of access. We implement
IDEA algorithm for secure data transmission.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 76


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Software Quality Attributes

Product is portable; it can run between only two connected systems or a large Network of comput-
ers. Product is maintainable; i.e. in future the properties of the product can be changed to meet the
requirements.

1.6. Apportioning of requirements

• Customer experience strategy: Leverages key insights from digital agency IBM Interactive to help
provide an enhanced multichannel customer experience, user experience design and full life-cycle
development

• Existing web experience enhancement: Approaches to better leverage content management, portals,
product catalogs and user experience

• Smarter sales and marketing: Techniques from the WebSphere Commerce development lab and
service support to help provide deep integration skills and faster-to-market deployment

• Time to value: Services leveraging prebuilt assets, global talent pools and accurate estimating tools
and techniques to help you get to market faster

Assignment 4
interface requirements
2.1.1 User interfaces
There would be efficient user interfaces. There would be a proper provision for the user to input the
data. The user can view the result of the classification, error-rates, and labeled-accuracy.
2.1.2 Hardware interfaces
Processor Pentium P4 or higher version

RAM 1GB or more

Hard Disk 40GB or more

2.1.3 Software interfaces


DatasetMysql 5.1
Operating System Windows XP SP2 or higher

Other Softwares- Relational Database, JDK1.6 or higher version, Eclipse, and Apache Tomcat web
(java) server.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 77


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

2.1.4 Communication interfaces


We use TCP/IP protocol for establishing connection and transmitting data over the network. We
use Ethernet for LAN.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 78


ANNEXURE C

Project planner
Used planner and project management tools.

79
ANNEXURE D

Reviewers Comments of Paper Submitted


1. Paper Title: Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

2. Name of the Conference/Journal where paper submitted :International Journal of Innovative Re-
search in Computer and Communication Engineering (An ISO 3297: 2007 Certified Organization) Vol.
4, Issue 1, January 2016.

3. Paper accepted/rejected : Paper accepted

4. Review comments by reviewer : Good

5. Corrective actions if any : Not recommended

80
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 81


ANNEXURE E

PLAGIARISM REPORT

82
ANNEXURE F

TERM-II PROJECT LABORATORY ASSIGNMENTS


1. Review of design and necessary corrective actions taking into consideration the feedback report
of Term I assessment, and other competitions/conferences participated like IIT, Central Universities,
University Conferences or equivalent centers of excellence etc.

2. Project workstation selection, installations along with setup and installation report preparations.

3. Programming of the project functions, interfaces and GUI (if any) as per 1 st Term term-work
submission using corrective actions recommended in Term-I assessment of Term-work.

4. Test tool selection and testing of various test cases for the project performed and generate various
testing result charts, graphs etc. including reliability testing.

83
ANNEXURE G

Information of Project Group Members

1.Name : Satyawan Umakant Dongare

2. Date of Birth :28/06/1993

3. Gender : Male

4. Permanent Address : balewadi

5. E-Mail : satyawandongare.sd@gmail.com

6. Mobile/Contact No. : 9923580902

7. Placement Details :

8. Paper Published : Name of the Conference/Journal where paper submitted :International Journal
of Innovative Research in Computer and Communication Engineering (An ISO 3297: 2007 Certified
Organization) Vol. 4, Issue 1, January 2016.

84
Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

9.Paper publish certificate

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 85


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

1.Name Komal Anil Gawali

2. Date of Birth :29/10/1994

3. Gender : Female

4. Permanent Address : balewadi

5. E-Mail : komalgawali029@gmail.com

6. Mobile/Contact No. 9766585213

7. Placement Details :

8. Paper Published : Name of the Conference/Journal where paper submitted :International Journal
of Innovative Research in Computer and Communication Engineering (An ISO 3297: 2007 Certified
Organization) Vol. 4, Issue 1, January 2016.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 86


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

1.Name : Minal Dilip Pathak

2. Date of Birth :07/03/1993

3. Gender : Female

4. Permanent Address : balewadi

5. E-Mail : minalpathak78@gmail.com

6. Mobile/Contact No. : 9236906760

7. Placement Details :

8. Paper Published : Name of the Conference/Journal where paper submitted :International Journal
of Innovative Research in Computer and Communication Engineering (An ISO 3297: 2007 Certified

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 87


Smart Crawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web
Interfaces

Organization) Vol. 4, Issue 1, January 2016.

GSMCOE,DEPARTMENT OF COMPUTER ENGINEERING 2016 88

You might also like