You are on page 1of 9

International 1ournal oI Advanced Computer Science, Vol. 3, No. 3, Pp. 130-138, Mar., 2013.

Manuscript
Received:
4, Oct., 2011
Revised:
1, Nov., 2011
Accepted:
17,Jan., 2013
Published:
15,Feb., 2013
Keywords
Information
Retrieval,
ALP, LDA,
Query
Enrichment,
Ontologies,
Semantic Web,
Reinforcement
Learning.
Abstract Search engines and information
retrieval (IR) systems provide a mechanism for users
to access large amounts of information available
through the Internet. However, in order to find the
desired information, the user has to go through a
staggering amount of information retrieved from
highly dynamic resources. Experimental results show
that the approach proposed for constructing
specialized domains improves the precision of
information retrieval. Our approach involves
enriching the user`s query with related linguistic
semantic and statistical semantic related concept
terms. We employ natural language process (NLP)
techniques such as WordNet engine to enrich the
user`s query with semantic lexical synonymous terms
and a probabilistic topic model such as latent dirichet
allocation (LDA) to extract highly ranked topic from
a query`s retrieved information. Furthermore, an
intelligent learning algorithm, reinforcement learning
(RL) is integrated into the design to assist end users
in selecting the concept domains that are most
relevant to their needs.
1
1. Introduction
InIormation seekers go through large amounts oI
inIormation through the Internet beIore they can Iind what
they look Ior. This process is becoming more diIIicult with
the ever growing amount oI inIormation available on the
Internet. Domain-speciIic inIormation retrieval systems can
alleviate such diIIiculty. These systems construct models
that deIine the domain oI knowledge the users actually are
looking Ior, based on their search queries. With the recent
demand and development toward knowledge-based
applications such as intelligent Question Answering,
Semantic Web and semantic level multimedia indexing and
retrieval |2|, the interest in a specialized domain knowledge
base has increased |7, 13|. Human-created domain
ontologies present strong semantic Ieatures but require both
time and consistency to grow large scale ontologies |20|.
ClassiIication and clustering are established methods Ior
constructing automatic domain ontologies. Hierarchical
agglomerative clustering (HAC) algorithm is one oI these
methods |4|. Such an approach Irom tree hierarchy structure
is both consistent and scalable; however, it is termbased and
1,
A. Mooman is with the Department of Electrical ana Computer
Engineering, University of Waterloo, Waterloo, CA. (Email.
amooomanuwaterloo.ca),
2,
O. Basir is with the Department of Electrical ana Computer
Engineering, University of Waterloo, Waterloo, CA (Email.
obasiiruwaterloo),
3,
A. Younes is with the Department of Systems Design Engineering,
University of Waterloo, Waterloo, CA (Email. ayoounesuwaterloo.ca).
not semantic-aware. In contrast, a specialized domain
collects terms that are related semantically (conceptually)
about a relevant domain. This paper presents an intelligent
model that constructs specialized domain ontologies to
enhance the mapping oI retrieved inIormation that pertains
to the interests oI the end users

.
A Specialized Domain (SD) reIers to a speciIic
knowledge environment about a specialized interest,
problem, or discipline.
It consists oI two subsets oI knowledge:
One subset contains terms that provide ontologies
to represent the specialty oI the domain. These
ontologies contain concept terms that describe the
inIormation objects in the domain; i.e., the terms
provide a semantic description oI the domain,
which is applied extensively Ior processing queries.
The second subset contains text documents that are
related to the domain terms. The specialized
domain is intended to be a description oI the
application domain Irom the point oI view oI end
users.
For example, the domain provides an ontology to
describe the application domain. The documents within the
domain are representatives oI the knowledge domains. The
presented approach aims to improve the precision oI
inIormation retrieval by mapping users` queries to relevant
inIormation domains. The approach combines two types oI
resources to automatically construct specialized domains oI
semantic concepts: lexical knowledge objects (dictionary
based), statistical knowledge objects (Real data such as
inIormation Irom the Internet) The WordNet engine is
adopted to enrich the user`s query with relevant ontologies
and LDA probabilistic topic model to extract high level
semantic topic Irom the retrieved inIormation. The cosine
similarity Iunction TFIDF is then applied to extend the
WordNet ontologies into the LDA extracted topic sets.
Mapping is perIormed by clustering around the highly
ranked terms oI ontologies to Iorm the newly discovered
domain sets. To ensure that the discovered domain sets are
relevant to what the end users are searching Ior, we
integrate a reinIorcement learning (RL) algorithm with the
specialized domain model. The RL algorithm |26, 27| is
applied to learn the users` behaviour through Ieedback
within an interactive environment.
The rest oI this paper is organized as Iollows. In Section
2, we present the system architecture design oI the proposed

This paper is an extension of a conference paper that was publishea in


ICCSIT 2011 [35{
Intelligent Learning Agents to Construct Specialized
Domain Ontologies
A. Mooman
1,
, O. Basir
2
, & A. Younes
3
Mooman et al.: Intelligent Learning Agents to Construct Specialized Domain Ontologies.
International 1ournal Publishers Group (I1PG)

131
approach. In Section 3, we describe how to enrich the user`s
query with semantic lexical terms using WorddNet. The
probabilistic topic model (LDA) was adopted Ior extracting
high ranked topics Irom the retrieved data is presented in
Section 4 and 5. In Section 6, we present a method to map
the LDA topic concepts and semantic enriched WordNet
ontologies to Iorm the new specialized domain ontologies.
Fig. 1. Specialized Domain Construction Architecture Design.
ReinIorcement earning model is presented in Section 7.
Experimental results and case study oI proposed method
using real Internet data and IR evaluation techniques TREC
|2931| are presented in Section 8. In conclusions, we give
ideas Ior Iuture research.
2. Architecture Design Overview
The task oI discovering and building a new domain
consists oI two parallel processes: topic extraction Irom the
search engine and query reIinement through the WordNet
Engine. As Fig.1 illustrates, the process oI achieving query
reIinement takes seven steps: First, query-related text
documents are retrieved Irom Internet. Second, and at the
same time, the query is enriched by inIusing its terms with
senes terms, extracted Irom the WordNet database. Third,
retrieved documents are Iiltered Irom the Web syntax
Iormat and converted into a text Iormat. Forth, once the
WEB documents are Iiltered, text documents are normalized
and stop-words removed. FiIth, using a Topic extraction
algorithm, semantic topics are extracted Irom the text
documents and clustered into relevant groups, through
agglomerative clustering. Sixth, once the query is inIused
by WordNet ontologies, the similarity Iunction process
assesses similarities between the enriched query hypernym
sets and each clustered topic group, and added each query
hypernyms into the topic group iI no similarity has been
Iound. Seventh and Iinally, the extracted and enriched topic
sets are mapped to RL learning environment in the Iorm oI
matrix grid in which each discovered topic set is
represented by a cell within the matrix grid. These steps are
discussed in detail in the coming sections.
3. Query Enrichment VIA Wordnet
Ontology
By leveraging WordNet lexical database |14|, the end
user`s query is enriched with WordNet lexical terms. The
WordNet ontology as deIined in |18, 20| is a large lexical
English database whose structure makes it a useIul tool Ior
computational linguistics, Data mining, inIormation
retrieval, and NLP |9|. The objective oI extracting lexical
synonyms Irom WordNet is to enrich the overall domain
ontologies with synset terms in addition to concepts
extracted through the Internet |3, 6, 20|. Querying Ior the
lexical enrichment |20| oI ontologies consists oI a number
oI steps, each oI which will be addressed in the remainder
oI this section. First, the query is normalized to tokenize,
remove all stopwords and apply stemming,
1
,..., }
n
q t t = where q is the initial query and t is
the query terms. Second, term synsets are extracted Ior all
the query terms, and synsets are searched Ior in WordNet.
Third, hypernyms oI each term are obtained through the
WordNet. Finally, the most relevant synset is selected and
add the hypernyms oI this synset according to the
corresponding term representation. Using WordNet, the
inIerior concepts (hypernym) sense oI each term are
agglomerated into a set oI ontology terms. Synonyms oI a
term can be enriched Iurther by checking the WordNet
Hyponymy and/or Hypernymy hierarchical structure, in
which the term would gain a more speciIic concept with
(hyponym) and more generic concept with (hypernym) |14|.
The new term word(s) is assigned Ior later semantic
matching processes with the extracted topics. A given
search query q is normalized (Iiltered and stemmed)
beIore it passed to WordNet to extract semantic lexical
hypernums sense Ior the query terms t;
1 2
( , ,..., )
n
q t t t = ,
where q is the query consisting oI normalized terms t.
1 11 1 2 21 2 1
( ) ,..., }, ( ) ,..., },..., ( ) ,..., }
n n n n nn
q t t t q t t t q t t t = = =
where ( )
n
q t is the query term and
nn
t are the lexical
synonyms terms extracted Iorm wordNet sense hypernyms,
For example, iI the user`s search query is Q {aiabetes
aiets}, the query will be enriched with WordNet ontologies
process by Iirst normalized into as shown in Table. 1:
Q(t
1
) {aiabetes}; and Q(t
2
) {aiets}: And then, extract
passed to WordNet engine to extract the ontology concepts:
In spite oI the vast lexical database oI WordNet and
other dictionary-based databases, it is possible that
important inIormation might not be Iound within in the
WordNet or similar dictionary-based sources (linguistic and
semantic knowledge resources).
International 1ournal oI Advanced Computer Science, Vol. 3, No. 3, Pp. 130-138, Mar., 2013.
International 1ournal Publishers Group (I1PG)

132
TABLE 1
AN EXAMPLE OF ENRICHING USER`S QUERY WITH WORDNET
ONTOLOGIES.
ThereIore, domain knowledge ontologies should be
comprehensibly relevant to various sort oI concepts. Real
data such as the Internet, in which the users` uses to seek
inIormation, can be utilized to extract such concepts.
4. Topic Extraction using Search
Engine
At the same time the query is enriched with related
concepts using WordNet, that same query is used to extract
relevant topics through a retrieved set oI documents Irom
the Internet (or inIormation repository). The newly
discovered specialized domain would comprise not only the
computational linguistic knowledge objects (WordNet
ontologies) but also on related inIormation extracted Irom a
global and dynamic evolved inIormation such as Internet.
The limitation oI WordNet is that it does not carry each
word or concept within its database; as a result, it would be
an ideal to extend the extracted WordNet senses with
concepts oI dynamic and real data. This paper presents an
unsupervised approach that would automatically construct a
specialized domain Ior knowledge based queries by
leverage the search engine applications to retrieve a set oI
documents related to the query. Retrieved documents will
be pre-processed using some oI NLP steps include text
normalization, stop words eliminated and Iiltered into a text
Iormat using tools such as HTML2TEXT. Through the
Latent Dirichelt Allocations (LDA) |1|, the retrieved and
normalized documents are induced to discover topics Irom
each document.
5. Semantic Topic Extraction
Extracting a topic that represents the document or a set
oI documents is one oI the current challenges in NLP and
IR research area |3, 5|, which is why latent topic extraction
technique has emerged as a popular topic algorithm Ior
identiIying topics Irom text documents based on semantic
concepts rather than on the bag oI words. LDA |1| is a
probabilistic topic model originally applied in natural
language processing, but it has been applied to extract topic
in various applications |15, 16, 21| and is also the best
semantic analysis algorithm suited Ior the purpose oI
domain construction this thesis proposes. Semantic analysis
algorithms like LDA that Iocus on topic detections in text
data include Latent Semantic Analysis (LSA) and
Probabilistic Latent Semantic Analysis (pLSA) |1, 21|.
These algorithms have recently been an area oI considerable
interest in Machine Learning |1, 8, 21|. The Iollowing
process generates documents in the pLSI model:
Pick a topic mixture distribution P(.,a) Ior each
document a,
Pick a latent topic : with probability P(:,a) Ior
each word token,
Generate the word token e with probability
P(e,a).
The probability oI generating a document a, as a bag oI
words e1.eN
a
(N
a
is the number oI words oI document a),
is:
( ) ( ) ( )
1
1 1
... , ,
a
a
N
K
N i
: i
P P : P : a e e e
= =
=
_ [
Some oI the LDA Ieatures superior to those oI pLSA
and cluster models can be listed as Iollowing:
1) Compared to the pLSI model, LDA resolves the
overIitting problem and the problem oI generating new
documents by treating topic mixture distribution as a set oI
random hidden parameters instead oI a large set oI
individual parameters |1|,
2) Compared to the cluster model, the LDA model
allows a document to exhibit multiple topics to diIIerent
degrees, which makes LSA more Ilexible than the cluster
model assumption that each document is generated Irom
only one topic |1, 21|.
For these reasons, the LDA algorithm, a statistical
model, has emerged as a popular topic algorithm that has
been applied to extract text document classiIication and
identiIy topics Irom text documents |1, 4, 15, 21|. A
thorough and complete description oI the LDA model can
be Iound in |1|.
A. Applying LDA to extract Topics
Since the data shown that LDA has been successIul at
extracting semantic topics Irom text documents, we have
extending LDA to induce domain concepts (topics)
construction in specialized knowledge base domains. This
section will construct an automatic domain concept Ior
specialized domain knowledge bases. The Iollowing steps
describes the proposed technique oI extracting domain
concepts utilizing LDA:
1) Text Document generation Retrieved documents Irom
the Internet are normalized by converting them into a text
Iormat. Each document is pre-processed using NLP steps,
Iiltered, and represented as a text document in a corpus to
be used as the input oI LDA.
2) Latent topic extraction with LDA. A set oI semantic
latent topics are produced by extracting topics Irom each
text document. Each document is associated with a topic
vector which speciIies the topic distribution oI the
document.
3) Clustering relevant topic groups. Once the topics
extracted Ior each text document, they are clustered into
groups by comparing them to each other using similarity
Iunction. This step is added to rank the extracted similar
topics into higher level oI topic in a hierarchical manner,
Cosine similarities with hierarchical agglomerative
clustering (HAC) |4, 5| between each pair oI topics is
applied. II the similarity between two topics is greater than
Mooman et al.: Intelligent Learning Agents to Construct Specialized Domain Ontologies.
International 1ournal Publishers Group (I1PG)

133
the threshold, the system clusters them into the same group.
Note that a topic may belong to several diIIerent categories.
The idea behind clustering around the topics is to rank
groups that have the most relative concepts |5|. This
ranking will be evaluated and ranked by the end users.
For example, Table. II depicts some oI the topics lists
extracted Irom the Web pages using top 10 retrieved pages
using the Yahoo.com, BING.com, Google.com, Clusty.com,
and National Library oI Medicine (NLM) search engines

.
Query (q): 'diabetes diet was used.
TABLE 2
AN EXAMPLE OF EXTRACTING TOPIC FROM THE INTERNET USING LDA
6. Mapping Concepts into Specialized
Domains
Mapping the semantic extracted ontology oI each group
into represent a specialized domain, would requires Iurther
enrichment with inIormation that is based on linguistic
terms that are relevant to the original query. The idea oI
combining both terms based on raw data retrieved Irom the
search engine and linguistic synonymous is to Iill the gap oI
having a domain that based on one oI the each knowledge
base that would not include valuable inIormation due to its
irrelevant. For example, some technical and scientiIic terms
that are not yet listed within the WrodNet database. Using
the cosine similarity Iunction to Iind the similarity between
the WordNet enhanced synonym terms and the search
engine extracted topic concepts, terms that are matches the
similarity threshold (i.e, 0.8) is kept or added into the topic
groups iI no matches are Iound. In synonym discovery,
cosine similarity Iunction deliver not only spelling variants
but also terms that belong to the same WordNet synset |28|.
The enhanced WordNet query hypernyms are added into
each semantic extracted topic (ontology class) groups iI
there is no similarity is Iound. The domain topic group
would consist oI the Iollowing inIormation:
1) domain IdentiIication (ID), each domain would have
an ID that is unique,
2) list oI extracted semantic and linguistic ontologies
terms,

Right Reservea to the listea search Engines ana


http.//vsearch.nlm.nih.gov.
3) domain relevancy weight; based on the Irequency
(weight) oI related terms within each domain,
4) usability, the total number oI times the domain been
selected is registered it also indicates the number oI times
the domain been dynamically updated. Since a domain
would be dynamically updated every time is used,
A constructed domain will consists oI the Iollowing:
{ }
1 2
| |,| , ,..., |,| , |
n f
D ID t t t selectea weight
e
= .
For example, Fig.3 shows the constructed domain oI
WordNet and Topic Extracted oI user`s search query is Q
{aiabetes aiets} mixed terms (i.e, metadata oI a speciIic
domain).
TABLE 3
AN EXAMPLE OF COMBINING WORDNET ONTOLOGIES TO THEEXTRACTED
TOPIC
7. Reinforcement Learning
RL appeals to many researches because oI its generality
|2527|. RL is the algorithmic model oI human learning by
trial-and-error. Basically, an agent interacts with an
environment by trying various actions to identiIy and select
those that yield the best average reward. Unlike a
supervised learning agent that has an instructor providing
the optimal action Ior each state, the RL agent must
discover these optimal actions by trying all the actions on
their own by trying all the actions and selecting the best one
|27|.
A learning environment task is deIined by a set oI states,
s S e , a set oI actions, a A e , a state-action transition
Iunction, : T S A S , and reward Iunction,
: R S A R . At each time step, the learning agent
selects an action, and then as a result is given a reward and a
new state. The goal oI reinIorcement learning is to learn a
policy. A policy, t, is a description oI the learning agent`s
behaviour and maps the transition Irom the perceived states,
S, oI the environment to the actions, A:
: S A . We use the on-policy algorithm such as
SARSA (State-Action-Reward-State-Action) where rewards
are observed at each state.
International 1ournal oI Advanced Computer Science, Vol. 3, No. 3, Pp. 130-138, Mar., 2013.
International 1ournal Publishers Group (I1PG)

134
The SARSA is diIIerent Irom Q-learning in that the
Qvalues are not updated according to the maximum reward
oI the next states |22, 27|. The SARSA algorithm uses the
new action, and thereIore, the reward is selected by the
same policy that determines the original action. Updates are
accomplished by using the original state and action, where
reward r is observed and the next state-action is paired.
SARSA updates not just on the trajectory (s
t
, r
t
, s
t1
), but
rather on the trajectory (s
t
, a
t
, r
t
, s
t1
, a
t1
). More
speciIically, given state action pair (s
t
, a
t
), SARSA
simulates the action at in state s
t
to obtain the reward r
t
and
transition to state s
t1
. The algorithm then uses its current
optimal policy-based on the current Q values to generate its
next action a
t1
(but with probability it chooses an action
at random) |27|. At this point, SARSA updates Q(s
t
, a
t
) as
Iollows:
( ) ( ) ( ) ( ) ( )
1 1
, 1 , ,
t t t t t t t
Q s a Q s a r Q s a

= (Equ. 1)
A. Learning Moael
As described in previous section, RL is expressed
Iormally in terms oI reward signals passing Irom interaction
with its environment. This interaction takes the Iorm oI the
agent sensing the environment and, accordingly choosing an
action to perIorm in the environment. The chosen action
changes the environment and this change is communicated
to the agent through a scalar reinIorcement signal. The use
oI a reward (Ieedback) signal to Iormalize the idea oI a goal
is one oI the most distinctive Ieatures oI RL |24, 27|. The
specialized domain construction model consists oI elements
that match the RL model elements:
A dynamic learning environment represented in a
matrix grid consisting oI a set oI Ontology topics to
be checked by the agent.
A state S which is the current state oI the agent at
one oI the grid cells represented by one oI the
ontology topic sets.
A reward R which represents the user`s Ieedback to
the inIormation presented.
An action A in which the agent moves within the
matrix grid; agent can move up, down, leIt or right.
A goal G that consists oI topics set that has the
height weight (Irequency) among other topic sets.
The RL learning environment acts as a guide Ior the user
to select the most relevant topic set that represents the
domain the user is looking Ior.
B. Learning Environment
In response to the user`s query, constructed ontology
sets are presented to the end user in the Iorm oI a learning
environment matrix grid. Each cell within the matrix grid
will contain a reIerence to an ontology set that is
constructed through WordNet and LDA. We denote the
extracted domain ontology sets as Constructed Domain D
set oI the end user`s query q:
( ) { }
1 2
, ,...........,
i f k
D q o o o =
where
( )
i f
D q
is the constructed ontology sets per
user`s query Ior each learning process and
k
o is one oI the
constructed ontologies set Ior query
i
q . Each ontology set
combines relevant concepts Irom WordNet hypernym and
LDA high ranked semantic topics: { }
1 2
, ,.....,
k l
o T T T = .
Suppose that the constricted domain ontologies set
consists oI the top six sets per user`s query:
( ) { }
1 2
, ,...........,
i f k
D q o o o = where
( )
k f
o q is the
ontology sets per user`s query.
( ) { }
1 1 1 2 3 4 5 6
, , , , , D q o o o o o o = . Each ontology set
consists oI combined terms extracted Irom WordNet
database and highly ranked topics extracted Irom the
Internet documents through LDA; { }
1 2
, ,.....,
k l
o T T T =
where
l
T is an ontology term . The learning environment
is represented in a matrix grid (world grid problem), where
the ontology sets are populated within the grid cells. Each
cell is a reIerence to the each ontology set. Each cell will be
represented by a letter Ior the purpose oI explaining the
example. { }
1
, , , , , D A B C D E F = where each letter
represents relevant inIormation: i.e.,
1
o A ,
2
o B ,
3
o C ,
4
o D ,
5
o E , and
6
o F . Assuming the
ultimate goal (ontology set) that carries the highest weight
(Irequency),
4
o D . Initially, the agent can be in any
state (cell) and can move Irom one state to another in Iour
directions( up, down, leIt or right) to Iind the goal. In this
simulation, the goal resides in (
4
o ) cell, which has an
instant reward oI 100.
Other states that do not have direct connections to the
target cell have zero rewards. The matrix grid can be
depicted in Table 4.
TABLE 4
MAPPING ONTOLOGY SETS INTO RL LEARNING GRID.
The task oI the learning agent is to Iollow the states that
will lead to the ultimate goal. A state graph representation
oI ontology sets is depicted in Fig. 4, in which nodes
represent ontology set, arcs represent agent`s next possible
states, and weights oI edges represent rewards. For example,
to reach the goal (D) Irom A, the agent would try one oI the
Iollowing possible learning paths:
,
.
A B D
A C Dor
A C E F D



Mooman et al.: Intelligent Learning Agents to Construct Specialized Domain Ontologies.
International 1ournal Publishers Group (I1PG)

135
Fig. 2. State diagram graph oI the learning grid.
The RL agent task is to assist the end user to select the
ontology set with the highest weight (the most relevant
concept to his/her query). Nevertheless, the ultimate
decision is made by the end user to select the ontology set
that relevant to his/her needs. The selected set will be
augmented into the specialized domain ontology repository
with its reward scoring based on the user`s Ieedback.
8. Experiment 1]: Experimentation
and Evaluation Results
In order to get a comparable evaluation and to examine
the eIIectiveness oI our specialized domain construction
model (SDCM), experiments and study cases were
conducted. More details will be presented in the Iollowing
sections.
A. Builaing Semantic Ontology
In this experiment, we compare the quality oI generated
ontologies using our approach with other approaches such
as Kea to extract key-phrases |19|, CLUTO clustering
toolkit |17|, and Ngram technique |12|. We have exploited
real search engines in our experiments pertain to actual end
users` in searching Ior real time and real data without any
prior knowledge or training data sets. For evaluating the
quality oI ontologies construction, we employed the above
search engines to retrieve 50 web documents Ior the
'diabetes diets, top 10 documents Irom each search
engine, webpages are converted into a text Iormat.
In our approach GibbsLDA LDA |16| is adopted
because oI its speed and its ability to design to analyze
hidden and latent topic structures oI various datasets
including text and Web documents. Using GibbsLDA
experiments run are set Ior 10 topics with 0.5 and |
0.1. We perIormed 200 Gibbs sampling iterations, where we
saved a model at every 25 iterations. The list oI 7 most
likely terms Ior each topic is saved Ior each model. Table. 5
shows the results oI the experiment (1) oI extracting
relevant terms using the 5 mechanisms including our
approach. It demonstrates that terms are rich and
comprehensive in respect to user`s query. Acting as
Metadata to enrich the user`s query and Iilter the retrieved
inIormation oI the query, only documents that contain the
domain ontologies will be mapped to the end users.
B. Experiment [2{. Enhancing the precision ana recall of
existing search engines though aomain speciali:ation.
In this experiment, we have measured precision and
recall (IR measuring standard) oI inIormation retrieved
using typical search engines as mentioned above including
our Intelligent Model. We have embedded LEMUE Toolkit
|32| IR application toolkit with our model as an
intermediate to index and retrieve relevant inIormation and
mapping them to the end users. We have adopted the
specialized domain ontology - |diabetic diet| - that was
constructed in experiment |1| Ior building domain
knowledge base and enriching uses` queries. In this
experiment, we retrieved 35 top pages Irom each search
engine including our approach and removed the duplicate
and irrelevant once Irom each search engine result; 210
pages (35 pages per search engine) were used Ior the
precision and recall experiment. In respect to users and web
pages, aIter the Iirst 15-25 web pages, the degree relevant
becomes very low. We have investigated our approach with
the publicly accessible search engine such as Google,
Yahoo, Bing, clusty and NLM. As shown in Fig.3, the
precision and recall oI IR are higher resulted Irom our
approach and NLM search engine due to the both are
specialized in inIormation only related to medicine about
aiabetes aiets.
As illustrated in Fig.3, the precision oI mapping users to
the relevant inIormation can be achieved through enhancing
the existing IR system and application by Iocusing on
domain specialization. Specialized domain construction is a
step oI constructing specialized knowledge domain to
address the on growing available inIormation and the
demand oI such inIormation.
Fig. 3. Precision and Recall oI Specialized Domain Construction with
Other Search Engines.
International 1ournal oI Advanced Computer Science, Vol. 3, No. 3, Pp. 130-138, Mar., 2013.
International 1ournal Publishers Group (I1PG)

136
TABLE 5
TERM EXTRACTIONS USING VARIOUS TECHNIQUES AND THE DOMAIN
CONSTRUCTION DOMAIN MODEL.
C. Stuay Case I. Evaluating IR performance using Static
Data
This study provides a comparable evaluation oI our
approach to construct specialized ontology domains using
static dataset. We have evaluated the precision and recall oI
the IR using once standard TREC-9 (OSHSUMO) original
data and later when the OSHSUMO dataset was
semantically enriched using the proposed approach oI
building specialized domain ontology. In both cases, the
query were set to be interrelated to a particular and unique
domain. Since the OSHDUMO dataset consists oI 348,566
OHSUMO document collection |33|, medical inIormation
(titles and/ or abstracts) Irom MEDLINE |33|, we have
applied queries that are related to one oI the medical
important domain, in this case, the Diabetic domain was
selected. TREC-1987-1991 dataset was used along with
topics, queries, and TREC evaluation techniques. In this
study, we have adopted the evaluation metrics used in
TREC-9 and integrated our approach with Lemur4.1 Toolkit
|32|, to index and retrieve documents Irom the OSHSUMO
dataset. The queries were extracted Irom the OSHSUMO
query set using the cosine similarity Iunction -TFIDF
algorithm, to be applied later Ior evaluating the ontology
domains and retrieving related inIormation. Terms and key
words that related to the Diabetic domain are used to extract
queries oI the domain we intend to build. As a result, we
have collected 44 queries related to Diabetic domain
inIormation.
Using TREC evaluation techniques and Leumr IR
toolkit, the constructed ontology domains were evaluated in
this study case using the the Iollowing methods:
1) Evaluation IR using original data set with unenriched
queries, we have measured precision and recall oI
inIormation retrieved using typical OSHSUMO 44
unenriched and unique queries with the OSHSUMO
original dataset. As illustrated in Fig. 4, the precision and
recall oI using un-enriched queries with OSHSUMO data
set were low.
Fig. 4. Precision and Recall oI TREC-9(OSHSUMO) dataset using
unenriched queries.
2) Evaluate IR using semantically enriched queries with
TEREC-9 dataset, In this experiment, the 44 queries were
semantically enriched using WordNet only. We have
measured precision and recall oI inIormation retrieved using
typical RECT-9 dataset including 44 enriched queries oI
Diabetic domain. As illustrated in Fig.5, the precision oI
using enriched queries with TREC-9 (OSHSUMO) dataset
has improved slightly but the total number oI related
document recalled is becoming lower.
3) Evaluate IR using specialized ontology domains, the
ontology domain was created using the static data, in this
case the Diabetic domain was chosen. In addition to
constructing a specialized ontology domain, the query set
that used in the previous experiments were Iurther enriched
external data source. In this experiment, queries were
enriched using our proposed approach; in which combines
both WordNet with Topic model using LDA in which the
queries were enhanced with external inIormation extracted
Irom the Internet.
From Iig.6, it is easy to observe that mixed query
enrichment using the proposed approach with static dataset
has improved the the IR precision and recall comparing to
the previous methods.
D. Stuay Case II. Evaluating IR performance using
aynamic aata
For Iurther evaluation oI our approach, this study case
evaluates the proposed approach using dynamic data (i.e.,
the Internet) to construct specialized domain and TREC to
evaluate the precision and recall oI IR. The evaluation oI the
IR precision and recall is based on building the specialized
ontology domains on dynamic data. We have adopted the
Internet as our inIormation source oI the dynamic data. The
dynamic data inIormation was retrieved (crawled) Irom the
Internet. To be consistent with our evaluation, we have
Mooman et al.: Intelligent Learning Agents to Construct Specialized Domain Ontologies.
International 1ournal Publishers Group (I1PG)

137
adopted TREC-9 evaluation techniques (tIidI, okapi, kljm,
klabs, and twostage) |31|, IR application (Lemur) as well as
the knowledge domain, in this case Diabetic.
Fig. 5. Precision and Recall oI TREC-9 (OSHSUMO) dataset using
enriched queries with WordNet only.
Fig. 6. Precision and Recall oI TREC-9 (OSHSUMO) dataset using
Intelligent domain model.
The main diIIerence oI this study case Irom the study
case I` is that both the query and data set were
constructed totally Irom the Internet. The specialized
domain ontology and knowledge domain oI the domain
diabetic were constructed using our intelligent domain
ontology. The query set oI this study was based on real time
end users queries that was extracted Irom a general-purpose
search engine AltaVista |34|. The domain-related
documents that are crawled Irom the Internet were
augmented into the original OHSUMO data set where they
indexed and ranked based on the most relevant to the
diabetic synonymous terms. The data set consists oI more
than 80,000 documents, and we selected unique 44 related
(diabetic) queries. Lemur tool kit was used Ior indexing and
retrieving the related inIormation.
As illustrated in Fig.7, the precision oI using dynamic
data (Internet) to enriched both queries and document has
improved the precision oI retrieval signiIicantly. Although
we have shown that the query enrichment using ontology
domains approach signiIicantly improves the IR
perIormance, there is still plenty oI room Ior improvement
iI both the inIormation and query are enriched as shown in
the second study. This show that the idea behind
ontology-based inIormation retrieval is to increase the
precision oI retrieval result taking the account the semantic
inIormation contained in queries and documents, liIting
keywords to ontological concepts and relations.
Fig. 7. Precision and Recall oI Dynamic dataset using Intelligent domain
model and TREC-9 evaluation technique.
9. Conclusion
This paper has proposed a novel approach to construct
specialized domain concepts Irom end users` queries. This
approach enhances the existing IR approaches and search
engines by Iinding the most relevant inIormation. The
general idea is to improve the precision oI inIormation
retrieval by mapping users` queries into relevant
inIormation domains. The construction oI knowledge
domains be it manually, using a dictionary, or using the
Internet data is becoming an increasingly diIIicult task
with the rapid growth oI available inIormation. As a result,
an automatic and intelligent approach is an ideal solution to
construct specialized domain ontologies. Our proposed
solution combines semantic lexical inIormation with
semantic topics extracted Irom the Internet that are Iurther
evaluated by the end user through reinIorced learning.
Results Irom the experiments and study cases have
demonstrated the eIIectiveness oI our approach to enhance
the existing IR systems and applications such as search
engines. For Iuture work, we are extending the model to
build specialized knowledge bases. In addition, we are
investigating the possibility oI combining social tagging
with reinIorcement learning (RL) agents in the system
design to improve the mapping oI inIormation through the
users` actions and behaviors.
References
|1| D. Blei, A. Ng, and M. Jordan, Latent dirichlet allocation,
In NIPS, 2001, pp. 601-608.
|2| S. Deerwester, S. T. Dumais, G. W. Furnas, and T. K.
Landauer, Indexing by latent semantic analysis, Journal
International 1ournal oI Advanced Computer Science, Vol. 3, No. 3, Pp. 130-138, Mar., 2013.
International 1ournal Publishers Group (I1PG)

138
oI the American Society Ior InIormation Science, 1990,
pp.41:391407.
|3| Shen, Dou and Pan, Rong and Sun, Jian-Tao and Pan,
JeIIrey JunIeng and Wu, Kangheng and Yin, Jie and
Yang, Qiang, Query enrichment Ior web-query
classiIication, ACM Trans. InI. Syst., Vol:24, 2006, pp.
320-352,
|4| Jian-hua Yeh4 and Naomi Yang, Ontology Construction
Based on Latent Topic Extraction in a Digital Library,
Book Series: Lecture Notes in Computer Science,
Springer Berlin / Heidelberg, Vol. 5362/2008, 2008, pp.
93-103.
|5| Jianping Zeng, Chengrong Wua, and Wei Wanga,
Multigrain hierarchical topic extraction algorithm Ior text
mining, Expert Systems with Applications, Volume 37,
Issue 4, April 2010, pp. 3202-3208.
|6| Endrullis, S.; Thor, A.; Rahm, E., Evaluation oI Query
Generators Ior Entity Search Engines, Proc. oI Intl,
Workshop on Using Search Engine Technology Ior
InIormation Management (USETIM), 2009-08
|7| Ying Ding, IR and AI: Using Co-occurrence Theory to
Generate Lightweight Ontologies, dexa, 12th
International Workshop on Database and Expert Systems
Applications, 2001, pp.0961.
|8| GriIIiths, T., Steyvers, M., Tenenbaum, J., Topics in
semantic representation, Psychological Review, 114, pp.
211244, 2007.
|9| Voorhees, Ellen M., Natural Language Processing and
InIormation Retrieval, InIormation Extraction: Towards
Scalable, Adaptable Systems, Springer-Verlag, 1999.
|10| Amit Singhal, Modern InIormation Retrieval: A BrieI
Overview, Bulletin oI the IEEE Computer Society
Technical Committee on Data Engineering, Vol. 24, No.
4. (2001), pp. 35-42.
|11| Baeza-Yates, Ricardo A. and Ribeiro-Neto, Berthier,
Modern InIormation Retrieval, Addison-Wesley
Longman Publishing Co., Inc., 1999.
|12| Cavnar,W. B., and J. M. Trenkle, N-Gram-based text
categorization, Proceedings oI the Third Symposium on
Document Analysis and InIormation Retrieval, Las Vegas,
NV, UNLV Publications/Reprographics, 1994, pp.
161175.
|13| E. Greengrass. InIormation retrieval: A survey, DOD
Technical Report TRR52-008-001, 2001.
|14| Miller, George A., WORDNET: A Lexical Database For
English, Princeton University, Princeton NJ, 1993.
|15| Girish Maskeri Rama, Santonu Sarkar, Kenneth HeaIield:
Mining business topics in source code using latent
dirichlet allocation, ISEC 2008: 113-120, 2008.
|16| X. Wei and W.B. CroIt, LDA-based document models Ior
ad-hoc retrieval, Proc. oI ACM SIGIR, 2006.
|17| George Karypis. CLUTO a clustering toolkit. Technical
Report 02-017, Dept. oI Computer Science, University oI
Minnesota, 2002.
|18| Chuang, Shui-Lung and Chien, Lee-Feng, Enriching web
taxonomies through subject categorization oI query terms
Irom search engine logs, Decis. Support Syst., Vol. 35,
Elsevier Science Publishers B. V., 2003.
|19| Ian H. Witten , Gordon W. Paynter , Eibe Frank , Carl
Gutwin , Craig G. Nevill-Manning, KEA: practical
automatic keyphrase extraction, Proceedings oI the Iourth
ACM conIerence on Digital libraries, 1999, pp.254-255.
|20| Nils Reiter1 and Paul Buitelaar2, Lexical Enrichment oI a
Human Anatomy Ontology using WordNet, In
Proceedings oI the 4th Global WordNet ConIerence,
Szeged, 2008.
|21| Tian, Kai and Revelle, Meghan and Poshyvanyk, Denys,
Using Latent Dirichlet Allocation Ior automatic
categorization oI soItware, MSR `09: Proceedings oI the
2009 6th IEEE International Working ConIerence on
Mining SoItware Repositories, IEEE Computer Society,
2009.
|22| Chris Watkins and Peter Dayan, Q-Learning`,
Machine Learning, 8, 1992, pp.279-292.
|23| Peter Dayan and Terrence J.Sejnowski, 'TD ()
Converges with Probability 1, Machine Learning, 14,
1994 pp. 295-301.
|24| Peter Dayan and Watkins, ReinIorcement Learning`,
Encyclopedia oI Cognitive Science London, England:
MacMillan Press, CJCH, 2001.
|25| Jason Rennie and Andrew Kachites McCallum, 'Using
ReinIorcement Learning to Spider the Web EIIiciently,
ICML-99, Workshop, Machine Learning in Text Data
Analysis, 1999.
|26| Leslie Kaelbling, Michael Littman, and Andres
Moore,'ReinIorcement Learning: A Survey. Journal oI
ArtiIicial Intelligence Research 4, 1996, pp. 237-285.
|27| Richard Sutton and Andrew Barto, 'ReinIorcement
Learning, MIT Press, 1998.
|28| C. Cattuto, D. Benz, A. Hotho, and G.
Stumme, Semantic Analysis oI Tag Similarity Measures
in Collaborative Tagging Systems, in Proc. LWA, 2008,
pp.18-26.
|29| Robertson, Stephen, How Okapi came to TREC. In E.M.
Voorhees and D.K. Harman (eds.), TREC: Experiments
and Evaluation in InIormation Retrieval`, pp. 287299.
MIT Press, 2005.
|30| E.M. Voorhees and D.K. Harman, editors, TREC
Experiment and Evaluation in InIormation Retrieval`,
MIT Press, 2008.
|31| J. Allan, M. Connel, W. B. CroIt, F.F. Feng, D. Fisher, X.
Li, INQUERY and TERC-9`, In TREC-9, 200, 504-
513.
|32| Lemur Toolkit. http://www.lemurproject.org/.
|33| Hersh WR, Hickam DH, Use oI a multi-application
computer workstation in a clinical setting`, Bulletin oI
the Medical Library Association, 82: 382-389, 1994.
|34| Craig Silverstein, Monika Henzinger, Hannes Marais, and
Michael Moricz, Analysis oI a very large AltaVista
query log`, Technical Report 1998-014, Systems
Research Center, Digital Equipment Corporation, Palo
Alto, CaliIornia, October 1998.
|35| A. Mooman, O. Basir and A. Younes, 'An Intelligent
Model to Construct Specialized Domain Ontologies, 4th
IEEE International ConIerence on Computer Science and
InIormation Technology (ICCSIT 2011), June, 2011.

You might also like