Enterprise Database Crawler For Heterogeneous Data Sources

International Journal of Advance Foundation and Research in Computer (IJAFRC)
Volume 2, Issue 9, September - 2015. ISSN 2348 4853, Impact Factor 1.317
Enterprise Database Crawler For Heterogeneous Data

Sources
Noufa Mukhtar1 Shayista Aftab2 Dr. Majid Zaman3
1 B. Tech CSE Student at university of Kashmir, , J&K India
2 B. Tech CSE Student at university of Kashmir, J&K India
3 Scientist , Directorate of Information Technology, University Of Kashmir, J&K India
1Nufawani992@gmail.com 2shayistabhat7@gmail.com 3zamanmajid@gmail.com
ABSTRACT
Information retrieval from distributed heterogeneous data sources is a challenging issue
.Although there is a crowd of search engines for web portals but enterprise users are still
suffering and are at the mercy of programmers, its not only information retrieval but complexity
and heterogeneity of data sources which make desired information retrieval complex, besides
there is huge repositories of files which are generated by enterprises in their day to day business.
Enterprises that own a large number of data sources suffer due to poor searching of crucial data.
Data sets are being produced independently by multiple researchers and for better cooperation
more intelligent searching techniques are needed. Data Warehouse is one such tool which
enterprises use as means of data integration, however data warehouse comes at a prices which
not all enterprises can afford neither technically nor financially. Enterprises require solution on
the fly- one were result are provided on run time without having to integrate data physically
which is of course cheap and reliable. In this regard ,we propose Wiggler, a heterogeneous
database crawler(search engine) for different data sources. (Abstract)
Index Terms : Database, Heterogeneous, Wiggler, Stress, crawler, Searching..
I.
INTRODUCTION
A web search engine is a software that is designed to search for information on the World Wide Web
[1].The search results are generally presented in a line of results often referred to as search engine
results pages( SERPs). The information may be a mix of web pages, images, and other types of files.
A search engine operates in the following order[4]:
1. Web crawling.
2. Indexing.
3. Searching.
Web crawling can be considered as processing items in a queue where the crawler visits a web page, it
mines links to other web pages. So the crawler puts these URLs at the back end of a queue, and continues
crawling to a URL that it removes from the front end of the queue. [5]
Web search engines work by storing information about many web pages, which they retrieve from the
HTML markup of the pages. These pages are retrieved by a Web Crawler (sometimes also known as a
spider) an automated Web crawler which follows every link on the site.[6]
When a user enters a query into a search engine (typically by using keywords), the engine examines its
10 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org

Volume 2, Issue 9, September - 2015. ISSN 2348 4853,
4853 Impact Factor 1.317
index and provides a listing of best-matching

best
web pages according
ing to its criteria [2] .The usefulness of a
search engine depends on the relevance of the result set it gives back [3].
II. PROBLEM
University Of Kashmir has various Heterogeneous Data sources e.g. library databases, examination
databases, administration
n databases etc .These are present in multitude of servers. Library database
holds information about different books present and records about books issued to the students of the
campus. Examination database holds information about dates and conduct of exams
e
of all the
departments and likewise the administration database contains information about the details of various
courses, departments and the employees of the campus. With so many data sources, retrieval of
information is a tedious task and need is a user friendly and time saving approach.
III. SOLUTION
Design of a data source crawler that searches through 'n' number of databases and in addition to
providing the results ,saves the result in a local repository called a Knowledge base to enhance searching
searchi
in the future in a time saving manner.
IV. ALGORITHM
1. Add sources - We connect our application server to heterogeneous database sources based on
their IPs. We maintain metadata which contains details of these sources like authentication
information, typee of the various data sources etc, and thus helps in establishing reliable
connections with these database servers . These servers form our registered servers .This
.
is our
arena from where we have to search the required keyword.
www.ijafrc.org

StepsStart
1-a)
1-b)
1-c)
End
Read data source details (from user interface).

Verify the given details.
Store data source details in local repository known as Metadata.
2. Add and identify a local repository which contains various directories and files which form a part
of search domain of this application.
3. Generate Queries - As per the search keyword provided by the user, multiple heterogeneous
queries are generated for all the registered sources. These queries are generated locally on the
application server and mapped to all the connected servers (local or remote). Also, the files of the
repository(of step 2) are traversed in a breadth first manner and each file is searched for the
keyword.
StepsStart
3-a)
3-b)
3-c)
3-d)
3-e)
3-f)
Establish the connection with the local database.

Read the search keyword given by the user.
If the keyword has been searched before and is present in the knowledge
base(local database), present the stored result set to the user.
If not, then establish connection with all the registered data sources from the
metadata(maintained in step 1).
Generate multiple queries for all tables of all the databases of all the sources.
Start from the root of the identified repository(of step 2) and search through all
the directories and files in a breadth first manner.
end
4. Execute Queries - The queries that are generated in step 3 are executed by the local query
processors of all the connected data sources .The results from all servers are then sent back to the
application server.
StepsStart
4-a)
4-b)
4-c)
The corresponding generated queries are sent to the local query processors of all
the registered servers.
Execute the queries (at local query processors).
Send the result back to the application server.
end
5. Save result - The results from all the sources are maintained in an solution set called knowledge
base. We maintain tables which contain records of all the successfully searched keywords
www.ijafrc.org

including their locations. We arrange the results on the basis of their frequency of being searched
and viewed. In this process, queries are generated once for each keyword because the next time
of their being searched, knowledge base is used to provide the result.
StepsStart
5-a)
5-b)
5-c)
If the search from any source is successful, save all the result sets in the knowledge
base.
Also, save the successfully saved keyword.
Present the results to the user.
end
6. Update knowledge Base - If a user is not satisfied with the result set from the knowledge base,
then the process of query generation is started again for that keyword and the knowledge base is
updated with the new result set.
StepsStart
6-a)
6-b)
6-c)
Repeat steps 3,4 and 5, but excluding step-3c.

Update the knowledge base with the new result set.
present the new results to the user.
end
V. CONCLUSION
The wiggler crawls through the various servers and an identified repository and returns a set of results in
the knowledge base. Wiggler is an attempt to retrieve user desired information without physical
integration, algorithm has been proposed for searching requisite information and saving the same in
knowledge base. The solution has been tested in university, however researcher/engineers are
encouraged to test the solution on large data sets and accordingly algorithm(if required) can be enhance
for precision and performance. This proposition can be extended to crawl through web portals in
addition to the database servers and the repository.
VI. REFERENCES
[1]
https://en.wikipedia.org/wiki/Web_search_engine.
[2]
http://www.techclinch.com/search-engine/
[3]
Rahul J. Jadhav Dr. P. K. Mudalkar - Evaluation of Search Engine Performance - International

Journal of Advanced Research in Computer Science and Software Engineering, Volume 2, Issue 4,
April 2012.
www.ijafrc.org

[4]
The Anatomy of a LargeScale Hypertextual Web Search Engine - Sergey Brin and Lawrence Page Computer Science Department, Stanford University, Stanford, CA 94305
[5]
Garcia-Molina, Hector. Searching

~cho/papers/cho-toit01.pdf
[6]
Web
Search
Engines:
Part
2
David
Hawking
http://web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf
the
Web,
August
2001,
http://oak.cs.ucla.edu/
CSIRO
ICT
Centre
www.ijafrc.org

Enterprise Database Crawler For Heterogeneous Data Sources

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Enterprise Database Crawler For Heterogeneous Data Sources

Uploaded by

Copyright:

Available Formats

International Journal of Advance Foundation and Research in Computer (IJAFRC)