You are on page 1of 10

Ontology-Based Automatic Classification for the Web Pages

:
Design, Implementation and Evaluation
Rudy Prabowo, Mike Jackson, Peter Burden Heinz-Dieter Knoell
School of Computing and IT University of Applied Sciences
University of Wolverhampton Fachhochschule Nordostniedersachsen
35-49 Lichfied Street, Wolverhampton Volgershall 1
WV1 1EL, United Kingdom 21339 Lueneburg, Germany
{Rudy.Prabowo,mj,jphb}@wlv.ac.uk hd@knoell.org
thing, and can be differentiated from another by defining its
Abstract properties [17]. The term "terminology" (in the plural form:
In recent years, we have witnessed the continual growth in "terminological resources") is defined as a structured, organised
the use of ontologies in order to provide a mechanism to enable set of concepts in the particular areas of specialist knowledge
machine reasoning. This paper describes an automatic [10]. The term "gestalt instance" is defined as a term which is
classifier, which focuses on the use of ontologies for classifying used to represent a whole rather than its parts. In addition, the
Web pages with respect to the Dewey Decimal Classification term "class representative" is used to denote a class in a
(DDC) and Library of Congress Classification (LCC) schemes. classification scheme hierarchy [2].
Firstly, we explain how these ontologies can be built in a
modular fashion, and mapped into DDC and LCC. Secondly, 2. Related Works
we propose the formal definition of a DDC-LCC and an
Our work is related to two different research areas, i.e.
ontology-classification-scheme mapping. Thirdly, we explain
ontology-based applications and automatic document
the way the classifier uses these ontologies to assist
classification.
classification. Finally, an experiment in which the accuracy of
the classifier was evaluated is presented. The experiment shows 2.1 Ontology-Based Applications
that our approach results an improved classification in terms of
accuracy. This improvement, however, comes at a cost in a low This section describes two related projects which are based
coverage ratio due to the incompleteness of the ontologies used. on the use of ontologies to extract and access information
within the Web.
1. Introduction The first one is a research project which has developed an
information extraction system, called WEB->KB, in order to
In order to organise Web pages and to assist Web users to construct knowledge bases from the World Wide Web (WWW)
retrieve only information relevant to their query, manual [12]. The second project is one which sets out to provide and
classification is carried out by some search engines, e.g. Yahoo! access information at a Web portal, called SEAL (SEmantic
[26]. Due to the increasing number of Web pages, it is portAL) [1]. There is a symbiotic mutualism between WEB-
impossible to manually classify the entire Web without some >KB and SEAL. WEB->KB provides a means to construct
form of automated aid [14]. For this reason, automatic knowledge bases which can be used by SEAL to satisfy a user
document classification has become an important research area. query. On the other hand, SEAL provides a means to bridge the
This paper describes a strategy used to enhance the gap between the knowledge bases built by WEB->KB and a
accuracy of an automatic classifier, called the Automatic Web user, and to provide information access to the Web user.
Classification Engine (ACE) [2]. The enhancement focuses on
the use of ontologies. Some related work which focus on the 2.2 Automatic Document Classification
use of ontologies are discussed in section 2. We, then, propose
a method for building a set of ontologies with respect to DDC This section describes work which has been carried out by
[20] and LCC [11], and propose formal definitions for DDC- researchers in the area of automatic document classification,
LCC and an ontology-classification-scheme mapping in section and the key difference between our work and other work.
3. The way the modified ACE works and makes use of these In their experiment, [15] proposed an approach for
ontologies are described in section 4. Section 5 presents an implementing automatic classification. The aim of their
experiment which was conducted to evaluate the classification research was to automatically classify research project
accuracy. Finally, conclusions are drawn and future work is descriptions into a manually pre-defined set of subject
explained in section 6. headings. They applied three techniques sequentially, i.e.
This paper uses terms which are defined as follows: the natural language processing, multinomial discriminant analysis
term "ontology" is defined as a single entity which holds [13], and an expert system technique, in order to achieve a high
conceptual instances of a domain, and differentiates itself from degree of classification accuracy.
another. In other words, conceptual instances represent the To carry out an automatic classification of 173,255 Wall
existence of their associated ontology. A conceptual instance Street Journal documents, [6] conducted an experiment, which
can be a concept, a terminology or a gestalt instance. The term used natural language techniques to carry out morphological,
"concept" is defined as something that represents the idea of a syntactical and semantic analysis of texts. They analysed and

Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02)
0-7695-1766-8/02 $17.00 © 2002 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 10:05 from IEEE Xplore. Restrictions apply.
compared the texts found within the document samples against tion language, called OIL [8].
a machine-readable dictionary, called Longman’s Dictionary of There is a difference in the way these two classification
Contemporary English. organised by both, subject and discipline. A thorough

DDC LCC Shared Classes (SC) Ontology
300 Social Sciences H 1-99 Social Sciences in sc-social-sciences social sciences
general
330 Economics H-HB 1-3840 Economic
sc-economics economics
theory. Demography
sc-economic-systems
sc-economic-theory
sc-demography
330 Economics H-HC 10-1085 Economic sc-economic-history economics
history and conditions sc-economic-conditions
331 Labor Economics H-HD 4801-8943 Labor sc-labor-economics economics
335 Socialism & related H-HX 1 – 970.7 sc-socialism political beliefs
systems. Socialism. Communism. sc-communism
Anarchism sc-anarchism
sc-facism
Table 1 A part of the class representatives in the area of social sciences and economics
with respect to DDC and LCC.
Our work is different from other work in that we focus on discussion about these two classification schemes can be found
(1) the use of domain ontologies, rather than a dictionary or in [25]. Due to the different strategy and naming convention
thesaurus, to assist classification; (2) building ontologies based used to organise and name the class representatives, mapping
on the text information within the classification schemes and these two schemes poses a special challenge. The first and
Web pages; (3) the mapping between a set of ontologies and the second column of table 1 depict a part of the class
classification schemes. We also restrict our attention to domain representatives in the area of social sciences and economics
ontologies, especially those which can be mapped into any with respect to DDC and the associated LCC. Section 3.2
existing classification schemes. The reason behind this discusses a proposed solution to this mapping.
restriction is that domain ontologies use a set of terminological The second step is to define an ontology which contains a
resources which can be used to capture the knowledge, which is set of classes, called "shared classes". The key idea in using a
valid for a particular type of a domain [5]. Hence, one can use shared class is to define a common class representative which
them to reduce the ambiguity and maintain the modularity of can represent a part of a class representative given a
the domain ontologies. classification scheme, and give an "anchor" for the associated
The advantages of an ontology-based classification conceptual instances. Table 1 (third column) depicts the shared
approach over the existing ones, such as hierarchical, - [24], classes. The prefix, "sc" is used to differentiate the shared class
and probabilistic - approach [18], are that (1) the nature of the from the associated conceptual instance when it has the same
relational structure of an ontology provides a mechanism to marker. For example, "sc-socialism" represents a part of these
enable machine reasoning; (2) the conceptual instances within two class representatives: "DDC 335 Socialism and related
an ontology are not only a bag of keywords, but have inherent systems" and "LCC H-HX Socialism Communism and
semantics, and a close relationship with the class Anarchism", and links the class representatives with the
representatives of the classification schemes. Hence, they can associated conceptual instance, "socialism".
be mapped to each other (see section 3); (3) this kind of The third step is to manually build domain ontologies
mapping provides a new way to measure the similarity between which are related to the class representatives. As a starting
a Web page and a class representative. It also enables us to get point, DDC and LCC vocabularies are used for defining the
insights into and observe the way the classifier assigns a class markers of conceptual instances within an ontology. The way
representative to a Web page by tracking the links between the these conceptual instances are organised, however, is not
conceptual instances involved and the associated class dependent on the way a classification scheme organises these
representative (see section 4). class representatives. The fact that "DDC 330 Economics" is a
subclass of "DDC 300 Social Sciences” does not imply that an
3. Building Domain Ontologies ontology about "social sciences" should contain a conceptual
This section describes our experience in building a set of instance, called "economics". Instead, two ontologies are
domain ontologies from scratch with respect to DDC and LCC created: "social sciences" and "economics". Hence, we can
class representatives, and the way these domain ontologies are maintain the modularity of these ontologies. This is particularly
mapped into the associated DDC and LCC class important, if we intend to refine these ontologies in more detail
representatives. and map them into other classification schemes. In contrast,
having too specific an ontology might degrade the degree of the
3.1 Building the ACE Database completeness of an ontology in representing the
conceptualisation of a domain. Another aspect in building an
There are four steps necessary for building the ACE
ontology is to define a gestalt instance that can cover specific
database. The first step is to define the mapping between DDC
ones. For example, one might want to know to which concept
and LCC class representatives, and to manually build a
"socialism" belongs. In this context, a new gestalt instance is
classification scheme ontology using a web-based representa -
defined, i.e. "political beliefs". In other words, the abstraction

Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02)
0-7695-1766-8/02 $17.00 © 2002 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 10:05 from IEEE Xplore. Restrictions apply.
in this third step is initiated by the way a classification scheme representatives which should be mapped into CRΥ,Β. The next
is organised, and further completed and refined by human two sections describe the formal definition of a full and partial
judgement. Table 1 (column 4) depicts the related ontologies. mapping of CRΧ,Α into CRΥ,Β. "≡" is used as a notation for
The last step is to use fast classification of terminologies "full mapping", and "≅" for "partial mapping". "A → B" means
(FaCT) [8] in order to detect the redundancy in naming a "A can be mapped into B".
concept marker, and to validate the "is-a", transitive and other
relationships between a set of conceptual instances within the 3.2.2 Formal Definition of Full Mapping
ACE database. CRΧ,Α can be fully mapped into CRΥ,Β (CRΧ,Α ≡ CRΥ,Β) if
In addition to OIL, RDF is adopted to express the the following three mapping conditions are met.
ontologies [19]. Although OIL does not need RDF to define (I) CRΧ,Α can be mapped into CRΥ,Β.
ontologies, the merging of OIL and RDF establishes a better
foundation for expressing ontologies. From the RDF point of
[CRΧ,Α → CRΥ,Β]
view, OIL enriches the expressive power of RDF in (II) For all elements of SCRΧ,Α ⊂ CRΧ,Α and SCRΥ,Β ⊂
representing ontologies. From the OIL point of view, the RDF CRΥ,Β, there is a one-to-one mapping between elements of
scheme [3] provides a mechanism for expressing ontologies SCRΧ,Α and SCRΥ,Β.
that can be understood by many Web participants. A thorough ∀(SCRΧ,Α ⊂ CRΧ,Α ∧ SCRΥ,Β ⊂ CRΥ,Β) [SCRΧ,Α,i → SCRΥ,Β,j] ,
discussion of the merging of OIL and RDF/RDFS can be found whereas {i = 1,...n} and {j = 1,...m}.
in [21]. (III) For all elements of ΩΧ,Α,i ⊂ SCRΧ,Α,i and ΩΥ,Β,j ⊂
To maintain the scalability of the domain ontologies with SCRΥ,Β,j, there is a one-to-one mapping between elements of
respect to DDC and LCC, the number of conceptual instances ΩΧ,Α,i and ΩΥ,Β,j.
within a domain ontology is restricted between 100 and 200.
∀(ΩΧ,Α,i ⊂ SCRΧ,Α,i ∧ ΩΥ,Β,j ⊂ SCRΥ,Β,j) [ΩΧ,Α,i,k → ΩΥ,Β,j,l]
The number of domain ontologies related to a set of DDC class
whereas {k = 1,...u} and {l = 1,...v}.
representatives in the second level ranges from 20 to 25. Since
DDC has 10 main first-level class representatives, the number 3.2.3 Formal Definition of Partial Mapping
of conceptual instances within the ACE database with respect
to DDC is expected to be between 20,000 and 50,000. The CRΧ,Α can be partially mapped into CRΥ,Β (CRΧ,Α ≅ CRΥ,Β)
determination of the maximum magnitude of a domain if the following condition is met. For all elements of SCRΧ,Α
ontology is chosen so that it enables easy maintenance, storage and ΩΧ,Α ,there is at least one element of SCRΧ,Α and ΩΧ,Α
and dynamic access. which can be mapped into CRΥ,Β, and either SCRΥ,Β or ΩΥ,Β.
∀(SCRΧ,Α ⊂ CRΧ,Α ∧ ΩΧ,Α ⊂ SCRΧ,Α) [∃SCRΧ,Α,i → CRΥ,Β]∧
3.2 Formal Definitions and Issues of Mapping [(∃ΩΧ,Α,i,k → SCRΥ,Β,j) ∨ (∃ΩΧ,Α,i,k → ΩΥ,Β,j,l)]
This section proposes formal definitions of mapping whereas {i = 1,...n},{j = 1,...m},{k = 1,...u},{l = 1,...v}
between two different classification schemes, and between a This formal definition, however, does not explicitly exclude
classification scheme and an ontology. the elements of SCRΧ,Α and ΩΧ,Α which are irrelevant to CRΥ,Β.
For this reason, we refine the partial mapping definition based
3.2.1 Formal Definition of a Classification Scheme on the use of shared classes (discussed in section 3.1). Based on
Let CRΧ,Α be a class representative, Χ, which belongs to the the equivalence (1)-(2) and (4)-(5), ЅCΧ,Α and ЅCΥ,Β are
classification scheme, Α, and a direct-superclass of a set of defined as follows:
subclass representatives, SCRΧ,Α. ЅCΧ,Α = CRΧ,Α ∪ [U (SCRΧ,Α,i , ΩΧ,Α,i,k)] (7)
SCRΧ,Α = U SCRΧ,Α,i ⊂ CRΧ,Α ,where i = 1..n (1) whereas {i=1..n} and {k=1..u}
Let also SCRΧ,Α,i be a class representative which covers a set ЅCΥ,Β = CRΥ,Β ∪ [U (SCRΥ,Β,j , ΩΥ,Β,j,l) ] (8)
of subclass representatives, ΩΧ,Α,i and each element of whereas {j=1..m} and {l=1..v}
ΩΧ,Α,i be a direct (or indirect) subclass of SCRΧ,Α,i. Based on (7) and (8), ЅCΧ,Υ is defined as follows:
ΩΧ,Α,i = U ΩΧ,Α,i,k ⊂ SCRΧ,Α,i ,where k = 1..u (2)
ЅCΧ,Υ = ЅCΧ,Α ∪ ЅCΥ,Β (9)
To guarantee the consistency of a subsume condition of a set of The equivalence (9) states that a set of shared classes are
class representatives, a transitive condition is defined as follows composed of a number of class representatives of the two
ΩΧ,Α ⊂ CRΧ,Α ⇔ (ΩΧ,Α ⊂ SCRΧ,Α) ∧ (SCRΧ,Α ⊂ CRΧ,Α) classification schemes, A and B. In order to have a reasonable
(3) partial mapping, a number of irrelevant elements of shared
Analogous to (1), (2) and (3), let CRΥ,Β be a class classes have to be excluded.
representative, Υ, which belongs to a classification scheme, Β, Let ЅC′Χ,Υ be a set of shared classes which are relevant to
and a direct-superclass of a set of subclass representatives,
assist partial mapping. Based on (7) and (8), ЅC′Χ,Υ is defined
SCRΥ,Β ,which cover a set of subclass representatives, ΩΥ,Β,
as follows:
and each element of ΩΥ,,Β,j be a direct- (or indirect) subclass
ЅC′Χ,Υ = (ЅCΧ,Α ∩ ЅCΥ,Β) = {SC′1,..., SC′w} (10)
of SCRΥ,,Β,j, where j∈{1,...,m}.
Hence, the partial mapping definition can be refined as follows:
SCRΥ,Β = U SCRΥ,Β,j ⊂ CRΥ,Β ,where j = 1..m (4)
for all elements of ЅC′Χ,Υ ,there is at least one element of
ΩΥ,Β,j = U ΩΥ,Β,j,l ⊂ SCRΥ,Β,j ,where l = 1..v (5) ЅC′Χ,Υ which represents SCRΧ,Α and ΩΧ,Α, and can be
ΩΥ,Β ⊂ CRΥ,Β ⇔ (ΩΥ,Β ⊂ SCRΥ,Β) ∧ (SCRΥ,Β ⊂ CRΥ,Β) mapped into CRΥ,Β, and either SCRΥ,Β or ΩΥ,Β.
(6)
Let us suppose CRΧ,Α represents a set of class
∀ЅC′Χ,Υ [(∃SC′p = ЅCRΧ,Α,i) → (CRΥ,Β ∈ ЅC′Χ,Υ)] ∧
[(∃SC′p = ΩΧ,Α,i,k) → ( (ЅCRΥ,Β,j ∨ ΩΥ,Β,j,l) ∈ ЅC′Χ,Υ]

Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02)
0-7695-1766-8/02 $17.00 © 2002 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 10:05 from IEEE Xplore. Restrictions apply.
whereas {p =1,...,w},{i = 1,...n},{j = 1,...m},{k = 1,...u},{l = that "an institution" is semantically different to "an
1,...v}. Based on the equivalence (10) and the mapping organization", unless it is explicitly stated in the classification
definition, one can map a number of classification schemes via schemes involved that "institution" is a subclass of
a set of shared classes without depending on the peculiarity of "organization".
the structure of the classification schemes involved. To
guarantee the consistency of abstraction, i.e. the hierarchical 3.2.5 Ontology-Classification-Scheme Mapping
structure of the class representatives involved, two partial
An ontology is defined as a sign system Ο = (ℒ, ℱ,Ǥ, ℂ,
mapping conditions are described as follows:
(I) [(∃SC′p = ЅCRΧ,Α,i ) → (CRΥ,Β ∈ ЅC′Χ,Υ)] ℋ,ℛ,Å). A complete description about the formal definition of
⇔ [SC′p ⊆ CRΥ,Β ∈ ЅC′Χ,Υ] an ontology can be found in [1]. In this paper, we present a part
of this definition which is used throughout this paper, i.e.
(II) [(∃SC′p = ΩΧ,Α,i,k) → ((ЅCRΥ,Β,j ∨ ΩΥ,Β,j,l) ∈ ЅC′Χ,Υ)]
⇔ [SC′p ⊆ (ЅCRΥ,Β,j ∨ ΩΥ,Β,j,l) ∈ ЅC′Χ,Υ] a set of concepts, ℂ. For each C ∈ ℂ, there is at least one
statement in the ontology, i.e. its embedding in the taxonomy;
3.2.4 Two Mapping Issues a taxonomy, ℋ. Concepts are taxonomically related by the
To illustrate the first issue, an example of mapping with irreflexive, acyclic, transitive relation ℋ ⊂ ℂ × ℂ. ℋ(C1, C2)
respect to DDC and LCC is described. Let us use, "DDC 306 means C1 is a subconcept (or subclass) of C2;
Culture and Institutions" as an example, and concentrate only
a set of binary relations ℛ.
on the associated LCC class representatives which are
The formal definition of an ontology-class-representative
semantically identical to one of the two topics of DDC 306, i.e.
"institutions". Institutions in DDC 306 covers political, mapping via a set of shared classes, ЅC′, is described as
economic, religious institutions, and all institutions which are follows: an ontology, Ο can be mapped into a class
pertinent to death and relations of the sexes. The associated representative, CRi, if and only if
LCC class representative for "political institutions" is "J-JF- (I) there is at least one element, Ci ∈ℂ which refer(s) to one
(20)-2112 Political institutions and public administration". The
associated LCC class representative for other types of element of ЅC′∈CRi
institutions does not exist. [(∃Ci ∈ ℂ ) → (ЅC′ ∈ CRi)]
Recall that LCC is a classification scheme which is
organised by discipline. The word, "institutions", however, (II) there is at least one element, Cj∈ ℂ, whereas Ci is a
refers to a broad subject, rather than a specific discipline. An subclass of Cj , and ℋ(Ci, Cj) refers to the hierarchical
attempt to fully map "institutions" to any related LCC class relationship of CRi and its direct superclass, CRj.
representatives will result in loss of precision. Therefore, the
formal definition of partial mapping is applied. [ (∃Cj ∈ ℂ ) ∧ ℋ(Ci, Cj) → ℋ(CRi,CRj)]
By applying equivalence (7), (8), (9), we obtain the shared
classes for "DDC 306 Culture and Institutions", and the 4. The Way the ACE Works
associated LCC class representatives. ЅC306-pertaining-to-
This section discusses the implementation of ACE. Section
institutions,DDC = {political institutions, economic institutions, 4.1 describes the dynamic term table used to store terms found
religious institutions, institutions pertaining to death, within a Web page, and, how the domain ontologies, DDC and
institutions pertaining to relations of the sexes}. ЅCassociated- LCC schemes are dynamically represented. The way the ACE
classes,LCC = {public administration, political institutions}. carries out classification is discussed in section 4.2.
ЅC = ЅC306-pertaining-to-institutions,DDC ∪ ЅCassociated- 4.1 Initialisation
classes,LCC. By applying equivalence (10), we exclude all
shared classes which are irrelevant to mapping. At start-up time, ACE creates a dynamic table, called "term
table", which consists of four columns, i.e. name, weight, tag,
ЅC′ = (ЅC306-pertaining-to-institutions,DDC ∩ ЅCassociated-
and position. It is used to store the terms found within a Web
classes,LCC) = {political institutions}. page. The first column, "name", is used to store the name of a
This means that a LCC class representative can be partially term, the second column and third column, "weight" and "tag",
mapped into DDC 306 based on the fact that there is one are used to determine and store the maximum weight of a term
element of ЅC′, i.e. "political institutions", which is and the tag within which the term occurs. How a weight is
semantically identical to one element of ЅC306-pertaining-to- obtained is discussed in section 4.2. The last column is used to
institutions,DDC and ЅCassociated-classes,LCC. store the position of a term, which is important for determining
The second issue is concerned with the abstraction of a a phrase.
class representative. For example, "DDC 306.1 Religious Then, ACE validates the syntax of the domain ontologies
institutions" and "LCC B-BL-630-(632.5) Religious and generates a set of triples [19] by using RDF – API [23],
organizations" refer to the same meaning, i.e. a group of people which contains a RDF parser. Based on these triples, ACE
who share the same interest or purpose. An attempt to map builds a semantic network [22] to represent the domain
these two class representatives via a shared class would, ontologies.
however, result in loss of precision due to the different meaning To represent the relationship between the conceptual
of "an institution" and "an organization". "An institution" is not instances, shared classes and class representatives, and to carry
only an organization, but also has influence in the community. out the classification process, a feed-forward neural network
For this reason, a partial mapping is not allowed due to the fact [22] is built. A feed-forward neural network is a neural network
in which each unit is linked only to units in the next layer.

Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02)
0-7695-1766-8/02 $17.00 © 2002 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 10:05 from IEEE Xplore. Restrictions apply.
There are three layers involved in this feed-forward Otherwise, a small weight, ω2, is assigned to the term. This
network model described as follows: (1) input layer: This layer means that if the two tags are different in terms of their
represents a set of input units, i.e. high level conceptual significance, ACE chooses the most significant one, and
instances: terminologies or gestalt instances; (2) hidden layer: rectifies the weight of the term as follows:
This layer represents a set of hidden units, i.e. shared classes. wt = ωt * tf
The hidden units are functional units that do not receive direct wt = the weight of a term, t.
inputs from the environment. Rather, they serve as an
ωt = the degree of weight of a term, t : {ω1, ω2}.
intermediate stage in the analysis of the input; (3) output layer:
This layer represents a set of output units, i.e. DDC and LCC tf = the number of occurrences of a term, t.
class representatives. The activation spreading is only from an This strategy facilitates the differentiation of significant terms
input layer towards the output layer. The Sigmoid function from insignificant ones. Note that ω1 and ω2, are tuning
[22], parameters which play a key role to determine the primary topic
f(x) = 1 / (1+e-x) , where x > 0 and 0 < f(x) < 1 and secondary topic of the Web page. How these two weights
is chosen as the activation function. The reason why Sigmoid are obtained is discussed in section 5.
function is chosen is given in section 5.2. Stage 2: Weighting conceptual instances.
The notation used in the feed-forward network is described Based on the domain ontologies within the ACE database, ACE
as follows: compares conceptual instances within the Web page with the
wI,S = a weight on the link between an input unit and a hidden unit. conceptual instances within the ACE ontologies. Here, "a
wS,C = a weight on the link between a hidden unit and an output unit. conceptual instance within a Web page" means a keyword or a
aI = an activation value of an input unit. phrase descriptor (from within the term table), which may
aS = an activation value of a hidden unit. consist of more than one term and refers to a concept. To
EI = an environment contribution of an input unit. identify phrase descriptors as a set of concept markers, ACE
The weights on the links between a parent unit and its adopts a phrase recognition method, called non-syntactic phrase
children units are contingent on the number of children given a indexing method [9].
parent unit, and are scaled so that the maximum sum of the When a conceptual instance within the Web page matches a
activation values of the parent unit is ≈1 (= 0.99). conceptual instance within the ACE database and the weight of
Based on the following three assumptions: (1) the strength the conceptual instance within the Web page is greater than ω1,
of an "xi" in the input level is the product of the activation value ACE determines whether the conceptual instance has a parent
of an input unit, "ai" and the weight on the link, w, plus a reference(s), other than a shared class reference. If yes, then
normalised environment contribution, E; (2) the activation ACE increments the weight of its parent. Otherwise, ACE
value of an input unit is 1, which means that this input unit is increments the weight of the conceptual instance. In other
activated; (3) the maximum sum of the normalised environment words, ACE converges (or accumulates) all weights on the
contributions (as described in section 4.2. stage 3) is 1, we can weight of the high level conceptual instance. This strategy is
determine the maximum sum of weights on the links between applied in order to identify the "gestalt" (or broader concept)
input units and a hidden unit: rather than the "parts" (or specific concepts) within a Web
f max( ∑ {wI,S * aI } + ∑ EI ) ≤ 0.99 page. This convergence is only allowed for "is-a", "cover" and
"part-of" relationships.
f max ( ∑ {wI,S * 1 } + ∑ EI) ≤ 0.99
For other types of relationships, ACE converges (or
∑ w I,S + ∑ EI ≤ ln 99
accumulates) the weight on a high level conceptual instance (or
∑ w I,S ≤ 3.59511985 the primary concept) based on the primary concept properties
The weight on each link involved can, then, be computed as and facet values. Based on the concept properties attached to
follows: w I,Si = (3.59511985) / n, where n is the number of the concept, and the facets which define the allowed values on
input units given a hidden unit. the relation between the primary concept and its concept
We can determine the maximum sum of weights on the properties, the primary concept can be identified.
links between hidden units and an output unit in the same way. For example, the primary concept, "communications" has
Since the environment contribution has already been taken into two concept properties: "types of communications" and
account, the weight on each link involved can, then, be "communications services". The facets for the concept
computed as follows: w S,Ci = (4.59511985) / n ,where n is the property, "types of communications" are "wireless, computer
number of hidden units given an output unit. and postal communications". The facets for the concept
property, "communications services" are "electronic mail, news
4.2 Automatic Classification Process and mail". Let us suppose, a Web page contains the words,
"wireless communications" and "electronic mail". Based on the
The five sequential stages in the automatic classification
facets, ACE can know that the Web page implicitly contains
process are described below.
two concept properties, "communications services" and "types
Stage 1: Analysing and weighting terms
of communications". This leads ACE to the conclusion that the
The terms found within a Web page are analysed, weighted and
primary concept of the Web page is about "communications".
stored in the term table based on the tag within which a term is
To avoid over-fitting, the concept weighting strategy
found. A term which occurs in a title, heading, meta-keyword
regards prepositions as a part of the concept marker, but does
tag is assigned with a large weight, ω1. Quite often, a term
not take the weight of prepositions into account, and focuses
which occurs in these three significant tags occurs again in
only on the weight of the nouns involved. The following
other tags. In this case, ACE increments the weight of this term
describes the algorithm for the weighting strategy. To simplify
with a large weight, although it appears in other tags, because it
the description of the algorithm, let Φ be a set of weighted
has been considered significant in the first occurrence.

Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02)
0-7695-1766-8/02 $17.00 © 2002 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 10:05 from IEEE Xplore. Restrictions apply.
keywords and phrase descriptors: {µ1,µ2,...,µn}, which are Using the feed-forward network model, ACE searches the
associated class representative(s). Based on the assumption that
constituted by a term(s) . Let also Ο be a domain ontology the normalised weights of the significant conceptual instances
which consists of are evenly distributed, the activation values would be
a set of concepts, ℂ: {c1, c2,...., cn}; continuous with an upper bound. For this reason, the Sigmoid
a set of relations, Ř : {ř1, ř2,...., ř n} which express "is-a", function is chosen as the activation function. The activation of
an upper level unit (a shared class or a class representative) is
"part-of", and "cover" relations;
possible if the strength of the signal, x is greater than the pre-
a set of relations, Ŕ : {ŕ1, ŕ2,....,ŕn} which express other types
defined threshold value, t = 0.5.
of relations.
CR1
procedure converge_weights(Φ)
{ wSC1,CR1 wSC2,CR1
for each µi ∈ Φ
{ SC1 SC2
for each cj ∈ ℂ
{ wCI1,SC1 wCI2,SC1 wCI3,SC2
if (µi=cj ∧ cj.relation=ř ∈ Ř ∧ cj.shared_class=nill) {
converge_weight_on_the_associated_superclass(cj, µi); CI1 CI2 CI3
} elseif (µi=cj ∧ cj.relation=ŕ ∈ Ŕ ∧ cj.shared_clas=nill) {
converge_weight_on_the_associated_primary_concept(cj, µi);
search_significant_associated_facet_values(primary_concept);
} elseif (µi=cj ∧ cj.shared_class=ЅCk ∈ ЅC) { EI1 EI2 EI3
store(cj, µi);
Figure 1 A feed-forward network model
}
} The feed-forward network is also used to assist the
} classifier to measure the similarity between the Web page and a
} class representative. The classifier regards the activation values
The procedure, "converge_weights" consists of three parts of an output unit (a class representative) as the similarity
which are described as follows: (1) the procedure, " coefficient which depends on the environment contributions,
converge_weight_on_the_associated_superclass" is called i.e. normalised weights of the conceptual instances (described
recursively until a high level conceptual instance which is in stage 3), and the number of children, i.e. high level
related to a shared class is found; (2) the procedure, " conceptual instances or shared classes, given a parent unit, i.e. a
converge_weight_on_the_associated_primary_concept" shared class or a class representative. This is the main
searches and converges weights on a primary conceptual motivation for employing the feed-forward network. The
instance, and passes it to the procedure following three examples illustrate the way the classifier
"search_significant_associated_facet_values", in order to measures the similarity coefficient of a Web page. Figure 1 is
identify whether the significant associated facet values of the used as the feed-forward network model for these three
primary conceptual instances are found; (3) in the case where a examples.
high level conceptual instance is found and linked to a shared Example 1: a Web page only contains a concept marker which
class(es), the procedure "converge_weights" only needs to call refers to the conceptual instance, CI1. This means that the
the procedure, "store" in order to weight and store the normalised weight of the conceptual instance is 1. The
conceptual instances. activation value of the class representative, CR1 is computed as
Stage 3: Weight normalisation. follows: f(aCR ) = f(wSC ,CR * aSC ) = f(w SC ,CR * f(wCI ,SC *
1 1 1 1 1 1 1 1
Before ACE searches for the associated class representative(s),
aI1 + EI1)) = f(2.2976 * f(1.7976 * 1 + 1)) = 0.8971
ACE normalises the weight of the significant conceptual
Example 2: a Web page contains two concept markers which
instances only. To capture the intuitive semantic value of the
refer to the conceptual instances, CI1 and CI3, and each of
significant conceptual instances, the normalised weights are
scaled so that the Euclidean norm of the sum of the weights is 1 them is assigned with a normalised weight, 0.7071. The
[16]. The Euclidean norm is defined as follows: activation value of the class representative, CR1 is computed as
follows: f(aCR ) = f(wSC1,CR1 * aSC1 + wSC2,CR1 * aSC2) = f(w
1
SC1,CR1 * f(wCI1,SC1 * aI1 + EI1) + wSC2,CR1 * f(wCI3,SC2 * aI3 + EI3))
= f(2.2976 * f(1.7976 * 1 + 0.7071) + 2.2976 * f(3.5951 * 1 +
0.7071)) = 0.9878
Example 3: a Web page contains three concept marker which
where n is the number of significant conceptual instances. A
refer to the conceptual instances, CI1, CI2 and CI3, and each of
conceptual instance which has a weight greater than predefined
threshold value, x = 0.5 is considered to be significant, and will them is assigned with a normalised weight, 0.5774. The
be taken into the next stage. activation value of the class representative, CR1 is computed as
Stage 4: Assigning class representative(s).

Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02)
0-7695-1766-8/02 $17.00 © 2002 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 10:05 from IEEE Xplore. Restrictions apply.
follows: f(aCR1) = f(wSC1,CR1 * aSC1 + wSC2,CR1 * aSC2) and only achieved 10% coverage ratio due to the
= f(wSC1,CR1 * f( (wCI1,SC1 * aI1 + EI1) + (wCI2,SC1 * aI2 + incompleteness of the domain ontologies used; of these, 207
Web pages classified to DDC 300 were randomly selected as
EI2)) + wSC2,CR1 * f(wCI3,SC2 * aI3 + EI3)) the third test set. As these 207 Web pages were not pre-
= f(2.2976 * f((1.7976 * 1 + 0.5774 ) + (1.7976 * 1 + classified by Google, we manually classified these Web pages.
0.5774)) + 2.2976 * f(3.5951 * 1 + 0.5774)) = 0.9894 To avoid bias we submitted queries to Google based on the title
Example 3 shows that a Web page which contains all texts of these 207 Web pages. Google was used to generate the
conceptual instances yields the largest activation value, 0.9894. categories for each of these Web pages when these were known
In contrast, a Web page which only contains one conceptual to Google.
instance yields the smallest activation value, 0.8971. In The domain ontologies which contain 402 conceptual
addition, example 2 shows that although the Web page does not instances related to the DDC "300 Social Sciences" were
contains all conceptual instances, it still yields a significant chosen. The reason for this choice was because DDC 300
activation value, 0.9878. This is justified by the fact that the covers many different subject areas. DDC 300 contains 92 class
Web page conceptually represents the two shared classes which representatives in the third level; of these, only 59 class
refer to the class representative, CR1. In other words, not only representatives were mapped to the related domain ontologies.
the weights of the conceptual instances play the role in the In other words, the domain ontologies used covered 64% of
activation value, but also the shared classes which represent all DDC 300. In addition, the domain ontologies that contains 398
the topics covered by the class representative. conceptual instances related to the DDC "500 Natural Sciences"
Stage 5: Generating and storing metadata. were chosen as a counter example. DDC 500 contains 95 class
The last step in the classification process is to generate representatives in the third level; of these, only 61 class
metadata for the Web page. The metadata contains the representatives were mapped to the related domain ontologies.
significant, expanded and collateral concept marker(s), and the In other words, the domain ontologies used covered 64% of
assigned DDC and LCC class representatives. Significant DDC 500.
concept markers refer to the conceptual instances which are The reason domain ontologies for DDC 500 were chosen as
considered to be significant in stage 3. Based on the semantic a counter example was that they are almost as many as the
links between the conceptual instances, ACE can look for and number of domain ontologies for DDC 300. Hence, domain
store the superclass of the significant conceptual instances in ontologies for DDC 300 and 500 are suitable for determining a
the metadata as expanded information. The main purpose of correlation between the completeness of domain ontologies and
this expanded information is to attach new relevant concepts to classification accuracy. The term, "completeness", refers to the
a Web page. Collateral concept markers refer to the conceptual extent to which an ontology can be said to be exhaustive for
instances which are considered to be insignificant in stage 3. representing a domain in terms of the representativeness of its
They are also stored in the metadata based on the assumption conceptual instances and the relationships among them. To
that they may be useful for a Web user seeking a specific precisely quantify the completeness of an ontology given a
information which is closely related to the significant concept domain, however, is almost impossible. With respect to this
markers. The metadata is stored using an XML database assumption, we argue that there was a differentiation between
management system, called Xindice [4]. the two distinctive ontologies in terms of their completeness.
From this point of view, this experiment was conducted to see
5. Experimental Evaluation as to whether this differentiation could affect classification
accuracy (section 5.3).
This section describes an experiment to evaluate the
effectiveness of the classifier in terms of its coverage and 5.2 Classification Coverage and Accuracy with
accuracy. Coverage ratio is measured as the ratio of the number
respect to the Weight, ω1
of classified Web pages to the total number of Web pages in a
collection. Accuracy ratio is measured as the ratio of the To determine the ratio of the small weight, ω2 to the large
number of correctly classified Web pages to the total number of weights, ω1, we used and labelled the 609 test samples. ω2 was
classified Web pages. set to 1 throughout this experiment. As the starting point, we set
ω1 to 1, and assumed that terms found within the three HTML
5.1 Test Samples and Domain Ontologies used tags, i.e. title, heading and meta-keyword tag, were the
For test samples, we manually collected Web pages from candidates which might be entitled to the large weight, ω1. For
within Google – Web – directories [7], i.e. 202 pre-classified each loop, ACE subsequently classified the samples, compared
Web pages which semantically refer to the DDC 300 in the the classification results with the predefined ones and measured
third level (as the first test set); 200 pre-classified Web pages the coverage and accuracy ratio. In order to know which terms
which semantically refer to the DDC 500 in the third level (as were entitled to the large weight, ω1, ACE classified the test
the second test set). In order to manually select the samples, we samples in three different modes. In the "T" mode, ACE only
traversed down through the Google path directories until we assigned a large weight, ω1 to the terms found within a title tag;
found the directories which semantically refer to the DDC class in the "T+H" mode within a title and heading tag; in
representatives. Then, we took from each of those directories the"T+H+K" mode within a title, heading and meta-keyword
pre-classified Web pages as our samples. The samples were tag. The loop was repeated until ω1 = 100.
chosen so that each of them represented a DDC class Figure 2a and 2b show that the classification accuracy
representative in the third or fourth level. In order to measure increased from 441 to 471 (+13.77%) for ω1 = {1...6}. This
the accuracy of classification results against a random gain, however, came at a cost in degradation of classification
collection, ACE automatically classified 303,000 Web pages coverage from 582 to 526 (–9.20%). For ω1 = {6...16}, the

Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02)
0-7695-1766-8/02 $17.00 © 2002 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 10:05 from IEEE Xplore. Restrictions apply.
classifier excluded all the significant terms found within a In contrast, figure 2b shows that there is a differentiation in
heading or meta-keyword tag. This degraded both, the coverage and accuracy percentage. The coverage percentage
classification coverage from 526 to 508 (–2.95%), and accuracy remained steady at 83.42%, whilst the accuracy percentage
from 471 to 460, although the accuracy percentage was remained steady at 90.55%. This suggests that taking terms
increased by 1.01%. For ω1 = {16...100}, the coverage and found within a header tag as a significant concept marker into
accuracy remained constant.

Weight, ω1 Weight, ω1
Figure 2a Coverage-Accuracy curve with respect to Figure 3a Coverage-Accuracy curve with respect to
the terms – found within the title tags – which were the terms – found within the title and heading tags-
assigned with the large weight, ω1. which were assigned with the large weight, ω1.

Weight, ω1
Weight, ω1
Figure 3b Coverage-Accuracy Percentage curve with
Figure 2b Coverage-Accuracy Percentage curve with respect to the terms – found within the title and
respect to the terms – found within the title tags – heading tags- which were assigned with the large
which were assigned with the large weight, ω1. weight, ω1.
Analogously, figure 3a and 3b show that the classification account can deteriorate classification accuracy. On the other
accuracy increased from 441 to 464 (+10.96%) for ω1 = {1..7}. hand, the coverage degradation is not as much as shown in
This gain, however, came at a cost in degradation of the figure 2a/2b.
classification coverage from 582 to 535 (–7.72%). For ω1 = Assigning terms found within a meta-keyword tag with the
{7...16}, the classifier excluded all the significant terms found large weight, ω1 does not result in better level of accuracy. In
within a meta-keyword tag. This degraded the classification fact, it causes the classification coverage and accuracy to
coverage from 535 to 527 (–1.31%), and the classification deteriorate. This suggests that terms found within a meta-
accuracy from 464 to 458, although the accuracy percentage keyword tag should not be assigned with a large weight, ω1.
was increased by 0.18%. For ω1 = {16...100}, the coverage and ω1 can be categorised into three intervals. The first weight
accuracy remained constant. Figure 3b also shows that by interval consists of a set of weights which result in a better level
assigning terms found within the title and heading tags with ω1 of accuracy. The second one consists of a set of weights which
= {16...100}, the classification coverage and accuracy degrade classification accuracy. The last one consists of a set of
percentage remained steady at ≈86%.

Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02)
0-7695-1766-8/02 $17.00 © 2002 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 10:05 from IEEE Xplore. Restrictions apply.
2
weights which do not affect classification coverage and was 15.8532 which is greater than χ (0.01) = 9.21. The
accuracy. To analyse the characteristic of these three intervals conclusion is that there is strong evidence to reject H0 (null
in more details, ACE classified the 609 sampes separately, i.e. hypothesis). This means that the level of classification accuracy
202 Web pages related to DDC 300, 200 Web pages related to on random pages had deteriorated significantly. The reason for
DDC 500, and 207 Web pages from the random collection. this deterioration is that the pre-classified, manually selected
Despite some fluctuations, the coverage-accuracy curves for Web pages from within Google Web directories contain
these three test sets yielded the same type of curves as shown in conceptual instances that fully match the domain ontology
figure 2a and 3a. The length of each interval, however, was used. In contrast, the random pages contain conceptual
varied and depended on the test set used and the modus. instances that do not fully match the domain ontology used. For
5.3 Correlation between the Completeness of this reason, the classification accuracy on random pages
deteriorated.
Domain Ontologies and Classification Accuracy In addition to the statistical test analysis described above,
This section presents an experiment which was conducted we found two factors that degrade the classification accuracy,
in order to see as to whether the completeness of the ontologies i.e. (1) incompleteness of the conceptual instances within an
used could make a differentiation in terms of classification ontology. The percentage was 99.25% (= 132 of 133 wrongly
accuracy. We used, for each test set, the ratio of ω2 to ω1 which and not classified samples). This had been anticipated from the
yielded the best performance in terms of accuracy, and only beginning of the project. We decided to start with small domain
assigned ω1 to the terms found within a title tag in order to ontologies. These ontologies mostly contain high level concepts
avoid bias, i.e. other factors that can degrade classification and some low level (or specific) concepts. The high level
accuracy, such as terms within a meta-keyword tag. The concepts used are closely related to (or the same as) the DDC
and LCC class representatives. It is not surprising that ACE
experiment tested: H0: there is no differentiation between
often fails to identify a specific concept due to the
classification results for DDC 300 and 500 in terms of
incompleteness of the ontology used; (2) inability of ACE to
classification accuracy (null hypothesis) against H1: there is
reason. The percentage was 0.75% (= 1 of 133 wrongly and not
differentiation between classification results for DDC 300 and
classified samples). This is because ACE does not think and
500 in terms of classification accuracy (alternative hypothesis).
reason as a human classifier does. For example, when ACE
ACE automatically classified the pre-classified Web pages
analyses and classifies a Web page about slavery, ACE found a
from within the Google - Web - directories. If ACE was able to
title text, "Beyond Face Value". The concept about slavery
classify the Web page as Google did, then the accuracy ratio
itself is contained in an image file. A human classifier can
was incremented. Table 2 below depicts the observed
easily classify this Web page since they can see information
frequencies (O).
which is encoded in the image file, and can arrive at a
Observed Frequencies (O) Correctly Wrongly Not
Classified Classified Classified
conclusion based on this image information and the title text.
202 pre-classified Web 166 9 27
pages related to DDC 300
6. Conclusions and Future Works
200 pre-classified Web 173 11 16 There are three issues in the ontology-based automatic
pages related to DDC 500
classification with respect to DDC and LCC. The first issue is
Table 2 Observed Frequencies (O). to map the LCC class representatives into the DDC class
2
χ
A chi-square, analysis was carried out. The critical value, representatives. Due to the different strategy and naming
χ2, was 3.1486 which is smaller than χ2 (0.05) = 5.99. The convention used to organise and name the class representatives,
conclusion is that there is insufficient evidence to reject H0 mapping these two schemes poses a special challenge. This
(null hypothesis). This indicates that differences in the paper proposes the formal definitions for full and partial
completeness of the domain ontologies used does not affect the mapping, and the use of shared classes to partially map these
accuracy of the classification. two classification schemes without loss of precision. The main
In order to detect whether there is a differentiation between idea of the use of shared classes is to define a set of common
classification accuracy on pre-classified and random pages with class representatives which conceptually refer to the related
respect to DDC 300, we conducted another experiment. The class representatives, and to link the class representatives with
experiment tested: H0: there is no differentiation between their associated domain ontologies.
classification accuracy on pre-classified and random pages with The second issue is how to correctly identify the main topic
respect to DDC 300 against H1: there is differentiation of a Web page based on the terms within the Web page. To
between classification accuracy on pre-classified and random tackle this issue, a term weighting strategy is applied in order to
pages with respect to DDC 300. Table 3 below depicts the clearly differentiate the significant terms from the insignificant
observed frequencies (O). ones. Subsequently, concept weighting strategy is applied to
Observed Frequencies (O) Correctly Wrongly Not identify the significant conceptual instances, which represent
Classified Classified Classified the primary concept of the Web page, based on the weighted
202 pre-classified Web 166 9 27 terms and structure of the domain ontologies. This strategy is
pages related to DDC 300 based on the assumption that the significant conceptual
207 Web pages classified to 137 9 61 instances are dependent to each other, represent the main topic
DDC 300 (from random of the Web page, and conceptually refer to the associated class
collection) representatives.
Table 3 Observed Frequencies (O). The third issue is how to map the significant conceptual
2 2
A chi-square, χ analysis was carried out. The critical value, χ , instances into their associated class representative(s), and to

Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02)
0-7695-1766-8/02 $17.00 © 2002 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 10:05 from IEEE Xplore. Restrictions apply.
represent this relationship. To tackle this issue, we propose the C. ROULIN, K. SCHMALENBACH, and E. TANKE, 1995, “The
formal definitions for ontology-classification-scheme mapping. Importance of terminology.“, Final Report of the POINTER Project.
Then, we use a feed-forward network model to dynamically [11] LIBRARY OF CONGRESS., 2002, “Library of Congress
represent the semantic links between a set of conceptual Bibliographies.“, <http://lcweb.loc.gov/rr/bibguide.html> (updated on
instances and their associated class representative(s). The 25 May 2001, accessed 13 September 2001).
model allows us to observe the way the classifier assigns a class [12] M. CRAVEN, D. DIPASQUO, D. FREITAG, A. MCCALLUM,
representative to a Web page by tracking the links between the T. MITCHELL, K. NIGAM, and S. SLATTERY, 2000, “Learning to
conceptual instances involved and the associated class construct knowledge bases from the World Wide Web.“, Artificial
representative Intelligence, 118(1-2), pp. 69-113.
The experiment results show a statistical improvement in [13] M. GOLDSTEIN, and W. R. DILLON, 1978, Discrete
terms of accuracy, when compared to previous classifier discriminant analysis, New York: John Wiley & Sons, Inc.
described in [2]. This means that the strategy applied for
[14] M. JACKSON, and P. BURDEN, 1999, “WWLib-TNG - new
classification based on the use of the domain ontologies does directions in search engine technology.”, IEE Informatics Colloquium
result in a better level of accuracy. This improvement, however, Lost in the Web - navigation on the Internet, pp.10/1-10/8, November
comes at a cost in a low coverage ratio due to the 1999.
incompleteness of the ontologies used (table 3, 3rd row). For
[15] M.J. BLOSSEVILLE, G. HEBRAIL, M.G. MONTEIL, and M.
this reason, we propose the adoption of natural language PENOT, 1992, “Automatic document classification: natural language
processing techniques which can be used to automatically processing, statistical analysis, and expert system techniques used
identify new concept markers, and machine learning techniques together.“, Proc. of the 15th annual international ACM SIGIR, June
which can be used to automatically integrate these new concept 1992, Copenhagen, Denmark.
markers into the existing ones, in order to complete the existing
[16] M.W. BERRY, and M. BROWNE, 1999, Understanding search
domain ontologies. Hence, a better level of accuracy can be engines – mathematical modelling and text retrieval. 1st ed.,
achieved without degrading the coverage ratio. Philadelphia: SIAM (Society for Industrial and Applied Mathematics).
Acknowledgement [17] N.F. NOY, and D.L. MCGUINNESS, 2001, Ontology
The authors thank Peter Musgrove for his invaluable comments. development 101: a guide to creating your first ontology. Knowledge
Systems Laboratory (KSL) of Department of Computer Science
7. References Stanford, USA: Technical report, KSL-01-05.

[1] A. MAEDCHE, S. STAAB, N. STOJANOVIC, R. STUDER, [18] N. GOEVERT, M. LALMAS, and N. FUHR, 1999, “A
and Y. SURE, 2001, “SEAL – a framework for developing semantic probabilistic description-oriented approach for categorising Web
Web portals.“, Proc.: conference of the 18th British National documents.“, Proc. of the 8th ACM International Conference on
Conference (BNCOD) on Databases, 2001, July, Chilton, U.K. Information and Knowledge Management, Novermber 2-4, 1999, pp.
475-482, Kansas City,U.S.
[2] C. JENKINS, M. JACKSON, P. BURDEN, and J. WALLIS,
1999, “Automatic RDF metadata generation for resource discovery.“, [19] O. LASSILA, and R.R. SWICK, 1999, “Resource description
Proc. of 8th International WWW Conference, May 11-14, 1999, framework (RDF) model and syntax specification. - W3C
Toronto. Recommendation 22 February 1999“, World Wide Web (W3C)
Consortium <http://www.w3.org/TR/REC-rdf-syntax/> (updated 24
[3] D. BRICKLEY, and R. GUHA, 1999, “Resource description February 2000, accessed 26 February 2001).
framework (RDF) schema specification - W3C Recommendation 23
March 2000.”, U.S.: W3C Consortium [20] ONLINE COMPUTER LIBRARY CENTER, Inc., 2002,
<http://www.w3.org/TR/2000/CR-rdf-schema-20000327/> (updated 27 “Dewey Decimal Classification.“,.
March 2000, accessed 26 February 2001). <http://www.oclc.org/dewey/about/ddc_21_summaries.htm> (accessed
13 September 2001).
[4] DBXML, 2001, “The dbXML Project.“,
<http://www.dbxml.org/> (accessed 7 January 2002). [21] S. DECKER, F. van HARMELEN, J. BROEKSTRA, M.
EERDMANN, D. FENSEL, I. HORROCKS, M. KLEIN, and S.
[5] D. FENSEL, 2001, Ontologies: a silver bullet for knowledge MELNIK, 2000, “The Semantic Web – on the respective roles of XML
management and electronic commerce. 1st ed., Heidelberg: Springer. and RDF.”, IEEE Internet Computing, 4(5), pp. 63-74.
[6] E.D. LIDDY, W. PAIK, and E.S. YU, 1994, “Text categorization [22] S. RUSSEL, and P. NORVIG, 1995, Artificial intelligence - a
for multiple users based on semantic features from a machine-readable modern approach. 1st ed., New Jersey: Prentice-Hall.
dictionary.”, Journal of the ACM, 12(3), pp.278-295.
[23] STANFORD UNIVERSITY, Database Group, 2001, “RDF API
[7] GOOGLE, 2001, “Google Web Directory.“, Draf.“, <http://www-db.stanford.edu/~melnik/rdf/api.html#overview>
<http/www.google.com/dirhp?hl=en> (accessed September 2001). (accessed 15 January 2001).
[8] I. HORROCKS, D. FENSEL, J. BROEKSTRA, S. DECKER, M. [24] S.T. DUMAIS, and H. CHEN,2000, “Hierarchical classification
ERDMANN, C. GOBLE, F. van HARMELEN, M. KLEIN, S. STAAB, of Web content.“, Proc. of the 23rd Annual International ACM SIGIR,
R. STUDER, and E. MOTTA, 2000, The ontology inference layer, July 24-28, 2000, Athens, Greece.
OIL. Information Society
Technologies.<http://www.ontoknowledge.org/oil/TR/oil.long.html> [25] T. KOCH, A. BRUEMMER, D. HIOM, M. PEEREBOORN, A.
(accessed 14 December 2000). POULTER, and E. WORSFOLD, 1998, “Specification for resource
description methods Part 3: the role of classification schemes in
[9] J.L. FAGAN, 1987, “Automatic phrase indexing for document Internet resource description and discovery.“, European Commission: a
retrieval - an examination of syntactic and non-syntactic methods.“, DESIRE project.
Proc. of the 10th annual international ACM SIGIR, June 3-5,1987,
pp.91-101, New Orleans, LA, U.S. [26] YAHOO! Inc., 2002, “Yahoo! search engine.”,
<http://www.yahoo.com> (accessed 7 January 2002).
[10] K. AHMAD, R. BONTHRONE, G. ENGEL, A. FOTOPOULOU,
D. FRY, C. GALINSKI, J. HUMBLEY, N. KALFON, M. ROGERS,

Proceedings of the 3rd International Conference on Web Information Systems Engineering (WISE’02)
0-7695-1766-8/02 $17.00 © 2002 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 10:05 from IEEE Xplore. Restrictions apply.