You are on page 1of 6

Extraction and Classification of Unstructured Data in WebPages for Structured

Multimedia Database via XML

Siti Z. Z. Abidin,

Noorazida Mohd Idris and Azizul H. Husain


Faculty of Computer and Mathematical Sciences,
Universiti Teknologi MARA,
40450, Shah Alam, Selangor, Malaysia.
{sitizaleha533, noorazida}@salam.uitm.edu.my, azizulh84@yahoo.com


Abstract—Nowadays, there is a vast amount of
information available in the internet. The useful data
must be captured and stored for future purposes. One of
the major unsolved problems in the information
technology (IT) industry is the management of
unstructured data. The unstructured data such as
multimedia files, documents, spreadsheets, news, emails,
memorandums, reports and web pages are difficult to
capture and store in the common database storage. The
underlying reason is due to the tools and techniques that
proved to be so successful in transforming structured
data into business intelligence and actionable
information, simply do not work when it comes to
unstructured data. As a result, new approaches are
necessary. Attempts have been undertaken by several
researchers to deal with unstructured data, but, so far it
is hard to find a tool that can store and retrieve the
extracted and classified unstructured data into a
structured database system.
This paper is to present our research on unstructured
data identification, extraction and classification of web
pages, which is then transformed into structured format
in Extensible Markup Language (XML) document, and
later stored into a multimedia database. The contribution
of this research is in the approach of capturing the
unstructured data and the efficiency of a multimedia
database to handle this kind of data. The stored data
could give benefits to various communities such as
students, lecturers, researchers and IT managers because
it can be used for any planning, decision-making, day-to-
day operations, and other future purposes.
Keywords — Unstructured data; Webpage; Data
extraction; Data classification; XML; Multimedia database
I. INTRODUCTION
Today, people’s lives are greatly influenced by
information technology due to the excessive exposure
of the Internet, reducing costs of disk storage and the
overwhelming amounts of information stored by
today’s business industries. While there is a vast
amount of data available, what are sorely missing are
tools and methods to manage this unstructured ocean of
facts and turn them into usable information.
Unstructured data includes documents, spreadsheets,
presentations, multimedia files, memos, news, user
groups, chats, reports, emails and web pages [1].
Merrill Lynch [2] estimated that more than 85 percent
of all business information existed as unstructured data
[3]. This data plays a significant part of a knowledge
base for an organization. The data needs to be properly
managed for a long-term usage. With three quarter of
data is unstructured, it represents a single largest
opportunity for positive economic returns. Through
effective unstructured data management, revenue,
profitability and opportunity can go up, while risks and
costs may go down [3].
This paper presents an exploratory study on various
tools for data extraction and classification and the
design of a prototype tool that is able to extract and
classify unstructured data in any web pages. The
classified data is structured in extensible markup
language (XML) before all the useful data is stored
into an Oracle multimedia database. With the analysis
on the current available tools, the prototype is designed
and implemented using C# programming by selecting
the significant methods among the current tools. In
transforming the unstructured data into structured
forms, image data type is converted into specific
format through double conversion for speed
performance in image retrieval (from multimedia
database).
This paper is organized as follows: Section II
describes the related work on all possible data
extraction and classification techniques while Section
III gives details explanation on research methodology.
Section IV demonstrates the results of this research and
Section V draws the conclusion.
II. RELATED WORK
The World Wide Web is a growing database in
which a great amount of information is available.
There are three types of web pages, which are
unstructured, semi-structured and structured web
pages. Unstructured web pages are those in which the
information of interest is within free text that no
common pattern can be induced. Semi-structured is
typically generated by using a template and a set of
data that can infer one or more patterns that may be
used to extract data from the web pages. A structured
web page is a web page that presents information in
HTML for human browsing, and also offers structured
data that can be processed automatically by machines.
This data is easily integrated into business processes.
Unfortunately, querying and accessing this data by
software agents is not a simple task since it is
represented in a human friendly format. Although the
software agent makes it easier for human to understand
and browse the Web, it makes the incorporation and
the use of this data by automated processes very
difficult. Some solutions for this can be the use of
semantic web [4], which is still a vision, or the use of
web services [5], but missing a complete specification.
An information extractor or wrappers [6] may fill the
gap and help in transforming the web into completely
44 978-1-4244-5651-2/10/$26.00 ©2010 IEEE

structured data that is usable by automated processes.
There are many extraction algorithms, but
unfortunately none of them can be considered as a
perfect solution. They are usually designed and built to
provide distinct interfaces, thus, leads to complicating
the task for integration of these algorithms inside
enterprise applications. There are some methods
introduced by some researchers to convert web page
data from either semi-structured or unstructured format
into structured and machine-understandable design and
XML is found to be the most popular.
A. Data Extraction
Data extraction is a process of retrieving and
capturing data from one medium into another medium.
The medium can be WebPages, documents, databases,
repository, stack or anything that consists of
information. Refer to the evText Website [7], data
retrieval is a process of locating and linking data points
in the user supplied document with a corresponding
data point in the data retrieval structure. A wrapper
accesses HTML documents and exports the relevant
text to a structured format, normally XML [8]. In order
to extract data from a webpage, two tasks need to be
considered; to define its input and its extraction target.
Input can be unstructured page, semi-structured and
structured page. The extraction target can be a relation
of k-tuple where k is the number of attributes in a
record or it can be a complex object with hierarchically
organized data [6]. Moreover, the difficulty of
information extraction can become complicated when
various permutations of attributes or typographical
errors occur in the input documents.
There are various classifications of wrapper. For
example, Hsu and Dung [9] classified wrappers into
four distinct categories, including hand-crafted
wrappers that had heuristic-based and induction
approaches. Chang [6] followed this taxonomy and
came up with systems that involved annotation-free
systems and semi-supervised systems. Muslea [10]
concerned on extraction patterns of free text using
syntactic or semantic constraints. However, the
complete categorization was made by Laender [11]
who proposed the taxonomy such as languages for
wrapper development that consists of HTML-aware
tools, NLP-based tools, wrapper induction tools,
modeling-based tools, and ontology-based tools.
B. Data Classification
Data classification is to categorize data based on
required needs. The goal of classification is to build a
set of models that can correctly predict the class of
different objects. There are many algorithms for data
classification used in data mining and machine
learning. They are also being used as the base
algorithms in some data extractor systems. For
example, the k-Nearest Neighbor (KNN) algorithm
[12] is mostly used in determining data (in terms of
distance) through its similarity with respect to its
neighbors. Naïve Bayesian (NB) algorithm [13] and
Concept Vector-based (CB) algorithm [14], are mostly
used for classifying words in documents. Other
methods to classify data, are using Classification and
Regression Trees (CART) algorithm [15], and
PageRank algorithm [16]. CART algorithm is
implemented using decision tree while PageRank uses
search ranking algorithm based on hyperlinks on the
Web.
Classifying data into several categories is important
because the raw data has to be matched with the
corresponding data classes specified in the database.
C. Tools for Data Extraction
Several tools are compared to look into their page
type, class, feature extraction, extraction rule type, and
learning algorithm. Table 1 depicts the results that help
in the design and implementation of the prototype tool
produced by this research. For the page type, the
structure of input documents is compared.
As proposed by Laender [11], the tools for data
extraction are developed based on several approaches.
HTML-aware tools are used for HTML documents that
require the document to be presented in a parsing tree.
The tree reflects its HTML tag hierarchy and it is
generated either semi-automatically or automatically.
Examples of tools are RoadRunner [17] and Lixto [18].
RAPIER [19], SRV [20] and WHISK [21] use
Natural language processing or NLP techniques to
build relationship between sentences elements and
phrases (filtering, part-of-speech tagging, and lexical
semantic tagging). These techniques derive extraction
rules based on syntactic and semantic constrains that
help to identify the relevant information within a
document.
There are also several wrapper induction tools
under generated delimiter-based extraction rules
derived from a given set of training examples. The
main difference between tools based NLP techniques
and these tools is that they do not rely on linguistic
constrains, but rather in formatting features that
implicitly delineate the structure of the pieces of data
found. Therefore, they are more suitable for HTML
documents than the previous ones. An example of this
type of tool is STALKER[10]. For the modeling based
tools, a target structure is provided according to a set of
modeling primitives that conform to an underlying data
model. Example of tools that used this approach is
NoDoSE [22] and DEByE [11].
There are also wrappers that use ontology-based to
locate constants that present in the page and construct
objects with them. This approach is different from all
approaches explained previously as it relies on the
structure of presentation data features within a
document, to generate rules or patterns to perform
extraction. However, extraction can be accomplished
by relying directly on the data. Example of this
ontology-based tool is the tool that developed by
Brigham Young University Data Extraction Group
[23].





45

TABLE I. COMPARATIVE STUDY ON EXTRACTION TOOLS


III. SYSTEM DESIGN
The prototype tool built using C# is designed and
developed based on a framework shown in Fig. 1. The
framework consists of User, Interface, Source, XML
and Multimedia Database layers. Each of the layers in
the framework communicates with each other in order
to retrieve or pass data from one layer to another.
















Fig. 1. Research Framework

Each layer has its functionality as follows:
• Layer 1 - This layer represents the user who will be
using the implemented system.
• Layer 2 - Interface is an interaction medium
between user and the source location that allows
user to manipulate data in the webpage. Examples
of interface are programming languages that
support network environment such as Java and
CSharp (C#).
• Layer 3 - Source layer consists of huge amount of
useful data in webpage in forms of structured,
semi-structured or unstructured page. User will
identify the useful data to be extracted from the
source, which later stored in a storage location that
can handle various types of the multimedia
elements. Before data can be allocated into the
storage, they need to be classified first. This
classification will be done to determine which data
should be allocated in the storage. This
classification is based on the type of data such as
text, image, audio, or video.
• Layer 4 - The result from the classification process
will be placed into a structured XML document.
The structured data is then transmitted to the
storage layer, which located at layer 5.
• Layer 5 - The storage used in this layer is a
multimedia database. The multimedia database
needs to be used for handling the huge amount of
data that consists of various elements of multimedia
types such as text, audio, images and videos format.
The example of a multimedia database that can be
used for storing purpose is Oracle 11g standards.
This database is able to support any type of data,
especially for business and commercial purposes.
A. Extraction and Classification
Classification of data patterns is important
especially for data extraction from the webpage. Fig. 2
illustrates the process of data extraction and
classification. Four classes of data have been
identified: text, image, video and audio. For each of the
class, there are several sub-classes which represent the
detail category of particular data. Data for media such
as audio, video and image will be identified when the
parser found the word “src=” in the data structure
during the extraction process. “src” is a keyword for
source reference, so the parser knows where to locate
the source data.


Fig. 2. Data Classes

After location of the source is detected, the parser
identifies its data type and classifies its class. For text
or label type, keyword is not required for references
USER
(LAYER 1)
INTERFACE
(LAYER 2)
SOURCE
(LAYER 3)
XML
(LAYER 4)
MULTIMEDIA DATABASE
(LAYER 5)
46

since it can be identified within the HTML tag
structure. Table 2 shows the class of data type and its
classification.

TABLE II. CLASSIFICATION OF FOUR MAIN CLASS DATA TYPE
Content Type Description
Text Consist of strings, numbers
and symbols
Image Various image formats
Video Various video formats
Sound Various sounds format

In classification process, Document Object Module
(DOM) tree technique is applied to find the correct
data in the HTML document. During web pages
navigation, DOM is used in the data extraction because
it allows DOM events to be processed simultaneously.
An example of DOM tree structure is shown in Fig. 3
below.



Fig. 3: Example of DOM tree

In the DOM tree, some unnecessary nodes, such as
script, style or other customized nodes need to be
filtered. This content is shown in the body node, which
is in the body tag. The advantage of DOM is that it is
filled with lots of information. However, some of the
unnecessary information cannot be eliminated
completely. With pattern classification, the
unnecessary information may be minimized during the
extraction process.
B. Implementation
Generally, in implementing the prototype system,
its architecture consists of six important components
that include web, generator, user, converter, XML
document and multimedia database. Fig. 4 depicts the
high-level view of the system architecture that shows
the flow of data extraction from Web page into
multimedia database.

Fig. 4: The Prototype Architecture

• Web-Web is a collection of information in World
Wide Web (www).The website consists of many
structured, semi-structured or unstructured data that
need to be captured for many purposes in different
areas such as financial statements, weather reports,
stock exchange, travel information, advertisements
and others.
• The Generator-This part shows how the generator
supports the user during wrapper design phase. The
generator is used to request HTTP web service
from the target web and also to retrieve data from
the web. The generator consists of three
components. There are visual interface, program
generator and temporary storage.
o Visual Interface–It defines the data that
should be extracted from web pages and
mapped it into a structured format like HTML
form. There are several windows in visual
interface itself such as a window for
displaying HTML structure of the web page, a
parse tree window to show the category of
data to be extracted from the current webpage,
and table for the result of data classification.
Moreover, a control window that allows the
user to view and control the overall progress
of the wrapper design process and to input
textual data are also in the visual interface.
o Program Generator–A program window that
displays the constructed wrapper program,
allows the user to make further adjustment or
correction to the program. This sub module
interprets the user actions on the web pages
and successively generates the wrapper. It
specifies the URL of web pages, which are
structurally similar to the target web pages, or
to navigate to such pages. In the latter case,
the link of navigation is recorded and can be
automatically reproduced.
o Temporary Storage–A temporary storage
location stores the result of data extraction
from the web. This storage location holds four
47

data categories that include text, image, audio
and video in separate locations.
• The User – This component specifies input data for
the generator, and categorizes the result of the
extraction process to be stored into multimedia
database.
• Converter–Consists of three types of converter for
data conversion whether from a XML documents to
a multimedia database or from a generator to XML
document.
o Bitmap converter–Convert various types of
image format into Bitmap and vice versa. This
converter can be used for images only.
o Base64 converter – Convert Bitmap into base
64 format and vice versa. This converter can
be used for images only.
o String converter – Convert all format types
into string format.
• XML–A structured storage for data classification
and as a medium for data transmission from web
into multimedia database. A XML document holds
various types of data such as text, audio, video and
image.

IV. RESULTS
The output or user interface of the prototype tool is
illustrated in Fig. 5 that allows users to store
multimedia elements in any specified web pages. A
progress bar is shown at the bottom of the screen to
show users the percentage of processing work done by
the engine. There are also several items in the main
menu such as Headers, URLs, URL’s title, Emails and
Phone, to assist in dealing with the information of
interest.



Fig. 5: Screenshot of the prototype system

Using the provided interface, a user can extract
useful multimedia data that resides in the webpage
specified in the URL column. This tool will extract
useful information by searching all the possible links
associated with the webpage. Fig. 6 shows part of the
links for a given URL as an example. It illustrates
various links to other web Pages as well as any data
that resides in the webpage.




Fig. 6: All related links associated with the given webpage

The useful multimedia data is classified using the
regular expression and DOM tree path learning
algorithm. Later, it is stored in a temporary XML file
with a specific format. All types of multimedia data are
stored according to their types; however, image type
will be converted into bitmap for fast processing and
retrieval. Fig. 7 depicts an example of the XML file.



Fig. 7: An example of the XML document for four types of data

From the XML format, all the possible valuable data
can be mapped to a permanent multimedia database for
a later usage. In this case, an Oracle 11g database is
used as the storage. Fig. 8 illustrates the output
classification for image type with its value and link.
Outputs for other data types are also presented in the
same manner as the image type.


48



Fig. 8: Classification Output for Image

The use of user interface in this prototype design
helps users to work conveniently with any web pages.
The menu and command buttons allow easy access to
the unstructured webpage and multimedia database.
Thus, this prototype system can be viewed as a tool in
extracting and gathering multimedia data of
unstructured information for systematic data
management.
V. CONCLUSION
This paper presents a prototype tool that extracts
data from any WebPages and store necessary
multimedia data into a multimedia database using
XML. The transformation from unstructured
information into structured data has been successfully
performed using various methods that include regular
expression and DOM parse tree. Thus, this prototype
has been developed to help the end users to get useful
multimedia data (text, image, audio and video) stored
for future retrieval and usage. This research also
performs a comparative analysis on eight of extraction
tools, namely, DeLa, EXALG, RAPIER, RoadRunner,
Stalker, SRV, WebOQL and WHISK, to ensure that
the best data extraction methods can be adapted for the
implementation of the prototype tool. The tools are
compared based on page type, class of tool, feature
extraction, extraction rule type and learning algorithm.
This research also has a new contribution in
introducing an automated unstructured data capturing
for structured storing that deals with multimedia data.
The prototype tool could be further enhanced by
displaying the data stored in the multimedia database
into any manageable forms such as report,
documentation, statistics and so on.
REFERENCES
[1] C. W. Smullen, S.R. Tarapore and S. Gurumurthi, “A
Benchmark Suite for Unstructured Data Processing”,
International Workshop on Storage Network Architecture and
Parallel I/Os, Sept. 2007, pp. 79 – 83, 2007.
[2] Merrill Lynch & Co., Inc.., http://www.ml.com/
index.asp?id=7695_1512, 2010.
[3] R. Blumberg and S.Atre, “Robert Blumberg and Shaku Atre”.
DM Review. Retrieved from
http://www.soquelgroup.com/Articles/dmreview_0203_proble
m.pdf/, 2003.
[4] T. Berners-Lee, J. Hendler and O. Lassila, "The Semantic
Web", Scientific American, May 2001, 284(5):34-43. 2001.
[5] G. Alonso, F. Casati, H. Kuno and V. Machiraju, “Web
Services Concepts, Architectures and Applications”, Springer-
Verlag, 2004.
[6] C.H.Chang, H.Siek, J.J. Lu, C.N. Hsu and J.J. Chiou,
“Reconfigurable web wrapper agents”, IEEE Intelligent
Systems, Vol. 18, Issue 5, Sept 2003, pp: 34 – 40, 2003.
[7] evText, Inc, https://www.evtext.com., 2008.
[8] Fiumara G. , "Automated Information Extraction from Web
Sources: a Survey between Ontologies and Folksonomies"
Workshop in 3rd International Conference on Communities and
Technology, 2007.
[9] C. Hsu and M. Dung. Generating finite-state transducersfor
semistructured data extraction from the web. J. Information
Systems, 23(8), 1998.
[10] I. Muslea, S. Minton, and Knoblock, “A Hierarchical Approach
To Wrapper Induction”. Proceedings of the third International
Conference on Autonomous Agents (AA-99), 1999 .
[11] A. H. Laender, R. Neto, and D. Silva, “DEByE –Data
Extraction by Example”. Data and Knowledge Engineering,
40(2): 121-154, 2002.
[12] K. Teknomo, ”K-Nearest Neighbors Tutorial”,
http:people.revoledu.comkardi tutorialKNN, 2004.
[13] W. Ding, Songnian Yu, Qianfeng Wang, Jiaqi Yu and Qiang
Guo, “A Novel Naive Bayesian Text Classifier”. International
Symposiums on Information Processing. 2008.
[14] R. Zhang, Zhongfei, “Image Database Classifcation based on
Concept Vector Model”. IEEE International Conference on
Multimedia and Expo, 2005.
[15] L. Breiman, J. H. Friedman, R. A. Olshen and C.J. Stone,
“Classification and Regression Trees”, Wadsworth, Belmont,
1984.
[16] S. Brin and L. Page, “Anatomy of a large-scale hypertextual
web search engine”. In Proceedings of the 7th International
World Wide Web Conference (Brisbane, Australia, Apr. 14 –
18), pp. 107–117. 1998)
[17] V.Crescenzi, G.Mecca and P.Merialdo, “RoadRunner: Towards
Automatic Data Extraction from Large Web Sites”. VLDB
Conference, 2001.
[18] R.Baumgartner, S.Flesca and G.Gottlob, “Visual Web
Information Extraction with Lixto”, Proceedings of the 27th
VLDB Conference, 2001.
[19] M. E. Califf, “Relational Learning Techniques for Natural
Language Information Extraction”. Ph.D. thesis, Department of
Computer Sciences, University of Texas, Austin,TX. Also
appears as Artificial Intelligence Laboratory Technical Report
AI, pp. 98-276, 1998.
[20] D. Freitag, “Information Extraction From HTML: Application
Of A General Learning Approach”. Proceedings of the
Fifteenth Conference on Artificial Intelligence (AAAI-98).
1998.
[21] S. Soderland (1999). “Learning Information Extraction Rules
For Semi-Structured And Free Text”. Journal of Machine
Learning, 34(1-3, pp. 233-272), 1999.
[22] B. Adelberg, “NoDoSE: A Tool For Semi-Automatically
Extracting Structured And Semi-Structured Data From Text
Documents”. SIGMOD Record 27(2), pp. 283-294, 1998.
[23] T. Chartrand, “Ontology-Based Extraction Of Rdf Data From
The World Wide Web”. Brigham Young University. 2003.
[24] J. Wang, “Information Discovery, Extraction and Integration
for the Hidden Web”. University of Science and Technology.
2004.
[25] A. Arasu, Garcia and H. Molina (2003). “Extracting Structured
Data from Web Pages”. Proceedings of the ACM SIGMOD
International Conference on Management of Data, San Diego,
California, pp. 337-348, 2003.
[26] G. Arocena and A. Mendelzon. "WebOQL: Restructuring
Documents, Databases, and Webs" in Proceedings of the 14th
International Conference on Data Engineering, Orlando,
Florida, pp. 24-33, 1998.

49