You are on page 1of 8

Developing Next Generation Dynamic Web Services

for Scientific Portals

By
Niraj Kumar
E-mail: nirajkumariitkgp@gmail.com
©2005 Niraj Kumar. All rights reserved.

Objective:

The main objective of this whitepaper is to develop a framework for next

generation web services for scientific portal . Integrating various databases and web
resources of specific domain available on the WWW and bring these
heterogeneous sources of data into common platform or in the form required by
the user. Developing a domain specific wrapper (CHEMWRAP) for extracting
useful information from HTML web pages and hidden Web and converting these
into directly usable form. Next stage is to develop fully automatic, dynamic and
generic web based service for portal which can based on user specific query,
able to crawl the Web, give the most relevant data from multiple sources and
convert these data into user specific requirement. Our goal is to simplify access
to chemistry data by providing a single access point to a large number of
sources.
Introduction:

The Web can be considered as world�s biggest data source and just textual data

amounts to at least hundreds of terabytes. The growth rate of Web is even more

dramatic with its size doubling every two years. However, the content of the Web

changes very fast and many of the past links and resources become dead while

many more newer one getting added every day. Aside from these newly created

pages, the existing pages are continuously updated. For example, in a study at

Stanford University of over half a million pages over 4 months, it was found that

about 23%

of pages changed daily. In the .com domain 40% of the pages changed daily,

and the half-life of pages is about 10 days (in 10 days half of the pages are gone,

i.e., their URLs are no longer valid) [Arasu et. al.].

Apart from this, a tremendous amount of content on the Web is dynamic. According to
an estimate close to 80% of the content of the Web is dynamically generated and that this
number is continuously increasing. This dynamism takes a number of different form like
temporal dynamism (time sensitive dynamic content), Client-based dynamism
(Customized web pages), Input dynamism ( Pages whose content depends on the input
received from the user) etc and further complicate the integration of web resources
[Raghavan et.al.]. However, little of these dynamic content is currently being crawled and
indexes by even most popular search engine and they usually index only static web pages
by following hyperlinks, ignoring search forms and pages that require authorization or
prior registration.
Crawling the hidden Web is a very challenging problem for two fundamental reasons.
First is the issue of scale: a recent study estimates that the size of the content available
through such searchable online databases is about 400 to 500 times larger than the size
of the �static Web�. Second, access to these databases is provided only through restricted
search interfaces, intended for use by humans. However, the domain specific scientific
portals needs to crawl and integrate these hidden Web databases to provide task specific
requirements of a particular user and application.
Most of the modern science and technological advancement were driven by

latest development in sciences and Information Technology and particularly web

based system is going to play very important role. With technical development of

applied sciences currently facing many limitations and ever increasing

complexities of problems, increasing use of computer to solve problems is only

natural. Today, Scientific and research portals domain covers wide areas from

physical, organic, inorganic to biochemistry, molecular modeling, biology, drug

design, Geosciences, Applied engineering, and many more. The scientific

community is globally distributed with culture of sharing and rapid dissemination

of information. Each separate area of science generates its own data and

information sources.

A large amount of scientific data is distributed over the Internet (for example: The

Cambridge Crystallographic Data Center, NIST, The Protein Data Bank .

Typically, these information is accessible only through custom web based query

interfaces. Each of these sources and databases have different structures,

contents, query languages and retrieved data in different format. Furthermore,

they are prone to having their interfaces and formats updated without warning

[Buttler et. al.]. To facilitate scientist, students and industrial users, a large

number of specialist interrogation, modeling, and software analysis tools are

available (for example: GAMESS, Gaussian, GAP, Ghemical, COLUMBUS,

COSMO-RS etc).

When scientific resource users require information from multiple sources, they

must pose the appropriate queries at each source individually then explicitly
integrate the result. This solution may be acceptable for a small number of

sources, but it quickly becomes an overwhelming burden for users as the number

of sources grow [Buttler et. al.]. Currently there are thousands of scientific

resources and databases, making it infeasible to manually gather required and

relevant data from these sources. Our proposed system aims to provide a user

interface where they can enter their query, then it should be able to perform the

following query formulation and execution tasks :

(a) identify sources and their locations both static as well as dynamic


(b) identify the content/function of sources and its type
(c) Clustering Web pages based on their structure and attributes
(d) Developing a generic wrapper CHEMWRAP which able to filter required
information from HTML pages as well as hidden databases from heterogeneous sources
(e) transform data in user required format
(f) merge results from different sources
(g) Optimize the whole system to give most efficient, secure and low cost solution
Challenges in developing future generation Web services can be broadly classified into
integration of number of interrelated problems like developing a system which in real
time able to identify the most relevant static and dynamic sources (This is essentially a
problem of developing a advanced search engine and crawling technologies with high
precision and recall ratio) , addressing the problem of heterogeneous of these resources
i.e. developing some Multi-database system (This is problem of portability and platform
independences from data sources to hardware and software used). Then extracting only
relevant information from these sources (i.e. developing the wrapper(s)/filter(s)
methodology and technology which includes larger problems of semantics of Web) and
finally developing customized user interfaces and integration of all software, networking
and hardware sub-systems. From last 30 years a number of efforts are being made in all
these areas separately and in last 10 years more focused attempts of integration of these
techniques and methodologies were taken. However, my literature review revealed that
any of present day system is far below the level of challenges posed by requirement of
such a system. Now, before going into methodology of our approach, I would like to
give a brief overview of attempts already made in this direction particularly with
reference to providing future generation Web services from heterogeneous sources.
Overview of Related work

Many researchers have tackled problems related to information extraction and integration

from the Web. These go from developing toolkits to add in building wrappers manually

and wrapper induction to the extraction of relational data from large collections of web

documents or extraction of symbolic knowledge. Some of wrapper construction

methods are manual, while others are semiautomatic and automatic. However manually

coding of wrappers become entirely impossible in current scenarios. Methodologies

employed to develop wrappers vary widely from finding pattern in HTML pages using

tree structure to finite state based approach to fuzzy set, artificial intelligence, and neural
networks based learning and training approach. Some of the well known research groups

and products in these areas are: ANDES, WysiWyg Web Wrapper Factory (W4F),

Ariadne, Garlic, TSIMMIS, XWRAP, Mostrare Project, STALKER, TAMBIS,

SoftMealy, FASTUS, HLRT Wrappers, Jedi etc.

XWRAP [Liu et. al.]: XWRAP is a semi-automatic XML-enabled wrapper construction

system for Web sources. The architecture of XWRAP consists of four components:

Syntactical structure normalization, Information extraction, Code generation, Program

Testing and packaging. XWRAP was developed in Java. By XML-enabling, it means that

the wrapper programs generated by XWRAP can transform an HTML document into an

XML document and deliver the extracted data content in XML format with a DTD.

STALKER [Muslea et. al.]: STALKER, developed in University of Southern California,

is a wrapper induction algorithm that generates extraction rules for semi-structured Web

based information sources using landmark automata. Based on just a few training

examples STALKER learns extraction rules for documents with multiple level of

embedding.

FASTUS [Hobbs et. al.]: FASTUS is a five stage system for extracting information
from natural language text. It works essentially as a cascaded, non-deterministic
finite-state automaton.
Decomposition of language processing enables the system to do exactly the
right amount of domain-independent syntax, so that domain-dependent semantic
and pragmatic processing can be applied to the right larger-scale structures.
Some of the blind experiments have demonstrated that it is very efficient.
WisiWyg Web Wrapper factory [Sahuguet et. al.]: W4F, developed at Penn Database
Research Group, is a toolkit that allows the fast generation of Web wrappers. Wrapper
generation consists of retrieval of an HTML page via GET or POST methods, followed
by construction of HTML parse tree according to the HTML hierarchy. Information can
then be extracted declaratively using a set of rules applied on the parse tree. A nested
string list (NSL) data structure is used as the datatype to represent extracted information
internally.
InfoSleuth [Bayardo et. al.]: The InfoSleuth project at MCC exploit and synthesize

new technologies into a unified system that retrieves and processes information

in an ever changing network of information resources. This is scalable and

portable and accomplished through the use of collaborative agents, and it uses

Java as a common wrapper agent.

TAMBIS [Baker et. al.]: The TAMBIS project at University of Manchester, UK, is
a three layer madiator/wrapper architecture which aims to provide transparent

access to various disparate biological databases and analysis tools. The use of

knowledge base and wrapped resources removes the need for user to know

which are the appropriate resources and how to access them. It greatly reduces

time taken to analyze their data. TAMBIS aims to use CORBA wrapped services.

Garlic: Developed at IBM Almaden Research Center, Garlic is a middleware

system that provides an integrated view of heterogeneous legacy data without

changing how or where data is stored. It provides a unified schema and common

interface for new applications without disturbing existing applications. This relies

on wrappers that encapsulate the underlying data and mediate between data

source and middleware.

Other projects which specifically aims at diverse and heterogeneous databases

are SINGAPORE, TSIMMIS, DISCO etc.

Problems related with crawling hidden Web and developing search engine were

addressed by Raghvan et. al., Brin et. al. among others.

METHODOLOGY

Methodology to be adopted for this study is to develop a web crawler specific to

scientific areas which able to crawl on the Web for available database resources. For

start we do not propose to crawl all the Web resources but try to stick to four or five

sources. But as most of the databases available are hidden and they have their own

data retrieval mechanism and user interfaces, we need to develop a crawler taking

into account all these factors. Then based on these sources we try to cluster

information into one based on their structural similarity. As each of these databases

have their own format but closely related one as each of them have data about

molecular structure of chemical compounds, so We propose a generic wrapper

CHEMWRAP which able to filter required information from HTML pages covert

them into a common format (say XML) and extract the required information and
convert them into format supported by some scientific software program. We will

develop whole our system in Java, XML, COM/CORBA and other Java based Web

technologies.

Decomposition of Web Information extraction task: The Wrapper generation

process is so complex that it is not possible to consider the construction process

occurring in a one single step. For this reason we have partition the CHEMWRAP

construction process into six phases (Figure 2). The interaction and information

exchange between any two of the phases needs to be performed . After the

preprocessing of sources, information extraction is started. The main task of the

information extraction component is to explore and specify the structure of the

retrieved document (page object) in a declarative extraction rule language. For an

HTML document, the information extraction phase takes as input a parse tree

generated by the syntactical normalizer. It first interacts with the user to identify the

semantic tokens (a group of syntactic tokens that logically belong together) and the

important hierarchical structure. Then it annotates the tree nodes with semantic tokens

in comma-delimited format and nesting hierarchy in context-free grammar. More

concretely, the information extraction process involves three steps; each step

generates a set of extractions rules to be used by the code generation phase to

generate wrapper program code.

Step 1: Identifying region of interest on the Page

Step 2: Identifying Semantics token of interest on the page

Step 3: Determining the nesting hierarchy for the content presentation of the page

Proposed System Architecture (Figure 1)

Scientific Portal

Client 1 Client 2 Client 3

HTTP

request
Cambridge databank NIST Databank Protein databank

Wrapper

Wrapper

wrapper

CHEMWRAP

Extracted Data

Data Integrator

XML form Data

Some Scientific software Format

Data

Result calculation by scientific

software

Decomposition of Web Information Extraction Task (Figure 2)

(CHEMWRAP Architecture)

HTTP Query Building

Enter asset of URL (s)

Fetching and Repairing Source document

Clustering Pages of Same Structure if

needed

Required Source Document

Generating a Parse Tree

Information Extraction

XML-enabled Wrapper Code Generator

Code Testing and Integration With

ChemCraft

Wrapper Code

Extraction rule
References

�Raghavan Sriram, Grecia �Molina Hector, �Crawling the Hidden Web�, Computer
Science Department, Stanford University, Stanford, USA, 2000, PP-25.
�Buttler David, Critchlow Terence, �Using meta-data to automatically wrap
bioinformatics sources�, Information and Software Technology, No-44, 2002, PP 237-
239.
�Baker Patricia G. et al ,�TAMBIS � Transparent access to Multiple Bioinformatics
Information Sources�, School of Biological Sciences, University of Manchester, UK
�Arasu Arvind, Cho Junghoo et. al., �Searching the Web�, Computer Science
Department, Stanford University, 2000, PP-42
�Habegger Benjamin, Quafafou Mohamad, �Multi-pattern wrappers for relation
extraction from the Web�, IRIN, University of Nantes, France, 2003,PP-5.
�Myllymaki Jussi, �Effective Web data extraction with standard XML technologies�,
Computer Networks, No � 39, 2002, PP 635-644.
�Liu Ling, Pu Calton, Han Wei, �XWRAP: An XML-enabled Wrapper Construction
System for Web Information Sources�, Georgia Institute of Technology, Atlanta, PP-11.
�Muslea Ion, Minton Steve, Knoblock Craige, �STALKER: Learning extraction rules for
semistructured webbased information sources�, IMSC, University of South California,
USA, PP-8.
�Hobbs Jerry R., Applet Douglas et.al., �FASTUS:A Cascaded Finite-State Transducer
for Extracting Information from Natural-Language Text�, Artificial Intelligence Center,
SRI International, California, 1997, PP-25.
�A. Sahuguet, F. Azavant. W4F, 1998. http://db.cis.upenn.edu/W4F.
�Bayardo R. J., Bohrer W. et. al., �InfoSleuth: Agent-Based Semantic Integration of Information in Open
and Dynamic Environments�, Microelectronics and Computer Technology Corporation, Austin, Texas,
1997, PP-12
�Roth Mary Tork, Schwarz Peter, �A Wrapper Architecture for Legacy Data Sources�, IBM Almaden
Research Center
�Brin Sergey, Page Lawrence, �The Anatomy of a Large-Scale Hypertextual Web Search Engine�,
Computer Science Department, Stanford University, PP � 26.
�Brin Sergey, �Extracting Patterns and Relations from the World wide Web�, Computer Science
Department, Stanford University, PP � 12.
�Liu Ling, Pu Calton, Han Wei, �An XML-enabled data extraction toolkit for web
sources�, Information Systems, No � 26, 2001, PP � 563-583
�Habegger Benjamin, Quafafou Mohamad, �Web Services for Information Extraction
from the Web�, Proceedings of the IEEE International Conference on Web Services, 2004, PP � 8.
CHEMW

RAP

You might also like