This action might not be possible to undo. Are you sure you want to continue?
An Approach to Build Ontology in SemiAutomated way
Jaytrilok Choudhary and Devshri Roy
Abstract—Ontology is used for representing the knowledge of a domain in a formal and machine understandable form in areas like intelligent information processing. Thus it provides the platform for effective extraction of information and many other applications. Generally, ontologies are developed manually. Manual ontology building requires lots of efforts by domain experts and hence time consuming and costly. To reduce the effort of manual ontology building, we have explored the feasibility of automatic ontology building. In this paper, we propose a methodology for building ontology in semi-automatic manner. Algorithms are developed for automatic extraction of concepts. Relationships among the concepts are assigned in semiautomated manner. The experimental result shows a fair degree of accuracy which may be improved in future with more sophisticated algorithms. Indexed Terms— Domain Concepts; Domain Ontology Building; Relationship among Concepts.
—————————— u ——————————
ntology has emerged as a very important discipline represent the main piece of knowledge presented in the in the areas of knowledge engineering. Ontology is Semantic Web. The ontology instances will serve as the used for knowledge representation, organisation Web pages and will contain the links to other instances and its acquisition. It is very useful for expressing and similar to the links to other Web pages. sharing the knowledge of semantic web. Ontology should be formal so that it becomes machine There are various ways to define "ʺwhat an ontology understandable and enable to share knowledge across the is?"ʺ communities . Ontology management tools provide the 1. From Artificial Intelligence point of view, facilities and environments to build a new ontology . ontology is defined as “explicit specification of Generally, the ontologies are constructed manually using conceptualization”. Conceptualization is the abstract representation of a real world entity with tools. Manual ontology building approach has many problems and some of them are mentioned below: the help of domain relevant concepts . • The manual ontology construction require lots of 2. From knowledge-‐‑based systems point of view, it human efforts hence time consuming and costly. is defined as “a theory of concepts/vocabulary used as building blocks of an information processing • It is a difficult task because correct installation system”. requires adequate domain knowledge. 3. A compositional definition of ontology is "ʺan • An efficient ontology can be built only by a ontology is a hierarchical organization of concepts domain expert. Sometimes he finds the task of along with relationship between them"ʺ. manually assigning the relationship between large number of concepts uninteresting and are Ontology can broadly be classified into three types: reluctant to do the work satisfactorily hence the Natural Language Ontology, Domain Ontology and Ontology precision of the ontology reduces. instances . Natural Language Ontologies contain lexical relations between the language concepts; they are large in In order to overcome the problems of manual ontology size and do not require frequent updates. Usually they building, researchers are working on semiautomatic and represent the background knowledge. Domain ontologies also on automatic ontology building approaches. Various capture knowledge of one particular domain, for example approaches have been proposed to build automated and pharmacological ontology, protein ontology etc. These semi-‐‑automated ontology ,,. However, more work ontologies provide detailed description of the domain has to be done to develop ontology automatically with concepts from a restricted domain therefore sometimes referred as ‘vertical’ ontology. Ontology instances good accuracy. Moreover, the World Wide Web is an ever increasing database of structured and unstructured data of various domains. There is a need to use this ———————————————— • Jaytrilok Choudhary is with the Department of Computer Science, Maulana information for development of domain ontology Azad National Institute of Technology, Bhopal, India. automatically.
• Devshri Roy is with the Department of Computer Science, Maulana Azad National Institute of Technology, Bhopal, India. www.jict.co.uk © 2012 JICT
cept extraction and stable domain concept extraction method. This method uses machine learning for extrac-‐‑ tion The basic idea motivating our work is to use the of field concept. Recommendation study is used to domain concept extraction. It largely improves the accu-‐‑ information available in Web to develop a domain racy of the concept extraction and the stability. There are ontology in semiautomatic manner. The work presented still many issues to be resolved in the field of ontology in this paper focuses on following aspects: learning like relationship learning. 1. Automatic extraction of the domain concepts from Wu yuhuang et al.  propose a web based ontology Web for building domain ontology. learning model. This approach concerns realizing the on-‐‑ 2. Assignment of relationship between concepts in tology’s automatic extraction from the Web page and dis-‐‑ covering the pattern and the relations of the ontology semi-‐‑automated manner The rest of the paper is organized as follows. First we semantics concept from the Web page data. It semi-‐‑ discuss the previous work done for ontology building. automatically extracts the Web ontology through the analysis of Web page collection in the identical applica-‐‑ The knowledge model proposed by us for development tion domain. of the domain ontology is presented in section 3. Section 4 Wen Zhou et al  have proposed a semi-‐‑automatic discusses the ontology building methology. The ontology technique that starts form small core ontology construct-‐‑ learning algorithm is presented in Section 5. In Section 6, ed by domain experts and learns the concepts and rela-‐‑ we present the experimental results. Finally, we conclude tions by use of the general ontology WordNet and event-‐‑ our work. based natural language processing technologies automat-‐‑ ically to construct the domain ontology. Relations learn-‐‑ ing for ontology are based on event extraction that finds 2 RELATED WORK out the verb relations between concepts. This method is Various methods have been proposed for ontology build-‐‑ fully based on WordNet to discover relationship between ing. concepts. Mei-‐‑ying Jia et al.  has proposed automated ontolo-‐‑ gy construction method. The method is not pure auto-‐‑ 3 KNOWLEDGE REPRESENTATION MODEL mated. It uses existing thesaurus and database of Military Intelligence. The thesaurus provides classes information The developed knowledge representation database confor the ontology and the database provides the instances. sists of domain concepts and domain dependent interHere, only three types of relationships are used between concept relationship. Concepts are automatically extractconcepts of constructed ontology. Finally uses, Protégé, ed from the web pages. We notice that if a concept is of open source editing tool, to represent ontology that pro-‐‑ significance in a document, then usually that document contains a large number of references to related concepts. vides a friendly interface for users. H. Kong et al. gave the methodology for building A set of relationship among the concepts are obtained by the ontology automatically based on the frame ontology natural language processing of the text. The concepts in from the WordNet concepts and existing knowledge data. the domain are organized into a di-graph as shown in The ontology building method is divided into two parts. Figure 1. The existence of an edge between two concepts One part is to make the possibility for building the ontol-‐‑ in the di-graph indicates that the concepts are related. ogy automatically based on the frame ontology from the Each edge is assigned a weight depending upon the disWordNet concepts that are the standard structured tance by which two concepts are related to each other. knowledge data. Other part is to make the more complete The weight is an indication of the strength of the relationontology using the specific input data made by the do-‐‑ ship. At present, a set of 5 relations Is_a, Is_a_Part_of or main experts. This method is not totally automatic, here has, Is_a_Kind_of, Is_Operated_by and Is_an_Example_of are first core concepts are taken from Wordnet and relation-‐‑ defined among the concepts. ship between concepts are limited to Wordnet only. J. Wang et al. used rule-‐‑based information extraction Computer as a method to learn ontology instances. It automatically Is_operated_by extracts the wanted factors of the instances, with the help has Is_a_part_of of the definition in domain ontology. A key technique for the use of IE is rule generation. they put forward a rule Hardware Software Firmware generation algorithm RGA-‐‑CIE which applies supervised Is_a_Kind_of Is_a_Kind_of learning with bottom-‐‑up strategy and uses a heuristic method to decide rule generalization path and laplacian* Is_a formula to evaluate the performance of rules. Operates Operating (1) System Where n is number of extractions made on the training set, es is the number of substitutions and ei is the number Fig. 1. Domain Knowledge Representation of insertions. Q. Yang et al. present an Ontology Learning method which combines personalized recommendation with con-‐‑
4 ONTOLOGY BUILDING METHOD
The method used to build ontology is shown in Figure 2. The web documents of the computer domain are collected from the dmoz Open Directory . The documents are preprocessed for extraction of domain specific concepts. Stemming is done to obtain the root word. Concepts are stored in the database. A set of relationship is assigned between the concepts. The steps for ontology building are described below.
<Topic r:id="Top"> <catid>2</catid> <d:Title>Top</d:Title> <lastUpdate>2010-02-16 08:43:34</lastUpdate> <d:Description></d:Description> <narrow r:resource="Top/News"></narrow> <narrow r:resource="Top/Science"></narrow> <narrow r:resource="Top/Business"></narrow> <narrow r:resource="Top/Health"></narrow> <narrow r:resource="Top/Computers"></narrow> <narrow r:resource="Top/Sports"></narrow> <narrow r:resource="Top/Arts"></narrow> …………………………………………………………………………………………………………… </Topic>
4.1 Document Collection from Large Corpus We have used the large corpus from dmoz Open Directory Project that is the most comprehensive human-edited directory in the World Wide Web . It contains information about various domains like art, business, computer, sports and so on. It is constructed and maintained by a vast, global community of volunteer editors. We have mainly focused on computer domain to build the ontology.
During RDF parsing, in first step, only domain oriented RDF structure is extracted from RDF. If we choose Computer as a working domain then all the Computer oriented RDF structures are extracted.
<Target><related <"Top/Computers/Mobile_Computing/Wireless_Data/ Software"/> <"Top/Computers/Internet/Protocols/"> <"Top/Computers/Data_Communications/Telephony"> <"Top/Computers/Data_Communications/Wireless"> <"Top/Computers/Data_Communications/Wireless"> <"Top/Computer/Software/Manufacturing/Automation"> <"Top/Computers/Software/ERP"> <"Top/Computers/Software/Graphics/Color_Management "> <"Top/Computers/Graphics/Fonts"/> <"Top/Computers/Companies/Product_Support"/> …………………………………………………………………………………………………………………………………… </related></Target>
Stemming Domain Corpus Concept Generation
Domain Concepts Defining Relationship among concepts Domain Ontology
Storage of Domain Knowledge Ontology Representation Fig. 2. Ontology Building Process
In the second step, all domain oriented concepts (Computer) are extracted. For example, Computer, Computer hardware, Firmware, Software and so on. After this step, all the domain oriented topics, sub topics and concepts are extracted from above RDF structure. Stemming: A group of words where words in the group are small syntactic variants of one another may share the same word stem. So, it is useful for the ontology learning system to identify such group of words and collect only the root word stem per group. For example, the groups of words: computation, computing, computes shares a common word stem, compute, and must be viewed as the same word for different occurrences.
4.2 Pre-processing of documents and Extraction of concepts The preprocessing phase of ontology building method is divided into two parts: RDF (Resource Description Framework) parsing and stemming. RDF Parsing: The corpus we have selected is present in RDF file format. So, first we need to parse it to collect domain oriented corpus and concepts. For example, all the domains are attached under Top label in RDF file as
4.3 Defining Relationship among concepts Defining Relationship is a very difficult and challenging part of ontology building process. Most of the ontology keeps a very few relationships among concepts such as ‘Is-a’ and ‘Part-of’ . However, the relationship list should be broad enough to cover most important forms so that the ontology can be utilized widely by different applications. Moreover, the list of relationships is different in different domains. The types of relationships are decided according to the requirement of the applications. Presently, we are working on Computer domain and the major relationships that are included in the developed
© 2012 JICT www.jict.co.uk
ontology are as follows: 1. Is_a: The ‘Is_a’ relationship is used to indicate generalization/specialization relationship between two concepts. E.g. Computer Is_a hardware. 2. Is_a_Part_of: The ‘Is_a_Part_of’ relationship indicates concept comprises two or more subconcepts. E.g. Memory Is_a_Part_of computer system. 3. Is_a_Kind_of: The ‘Is_a_Kind_of’ relationship shows behavior similarity between two concepts. E.g. Firmware Is_a kind_of software. 4. Is_Operated_by or operates: The ‘Is_Operated _by’ or ‘operates‘ relationship shows the operational behavior between two concepts. Like software operates computer or Computer is operated by software. 5. ‘Is_an_Example_of’ : The ‘Is_an_Example_of’ relationship shows example relationship between two concepts like one concept is an example of another concept. E.g. RAM is an example of Memory device. Concepts along with relationships are stored in the database as given in Table 1. TABLE 1 RELATIONSHIP AMONG CONCEPTS Concept1
Computer Computer Firmware Firmware Firmware …………..
6 EXPERIMENTAL RESULTS
We have developed ontology specific to the computer domain. Concepts are automatically extracted from the dmoz Open Directory. The precision and recall measures are widely used in the field of Information extraction to evaluate the effectiveness of domain concept extraction . (2) (3) Where, AConcept = Total number of concept extracted accurately DConcept = Total number of domain specific concept TConcept = Total number of concept There are total 500 concepts. Out of 500 concepts, 400 concepts are specific to computer domain. However, our algorithm returns 370 concepts accurately. We obtain precession and recall of 74% and 92% respectively. For relationship learning, we have processes large documents of about 2000 sentences. 60 sentences have been extracted in which one of the noun words (concept) is already existing in the domain ontology. From theses sentences the other noun words are extracted and added to the ontology. Ontology learning algorithm also extracted 44 sentences consisting of two noun words and both the noun words are already exist in the domain ontology. Total 104 sentences are successfully extracted, 60 new concepts are added along with relationships successfully. if we add 60 concepts manually, it takes lots of time to process domain corpus and find new concepts.
Software Hardware Computer Software Hardware …………
Is_Operated_by Has Is_a_Part_of Is_a_Kind_of Is_a_Kind_of ………………….
5 RELATIONSHIP LEARNING ALGORITHM
Our main goal is to assign the relationship between concepts automatically. To extract relationship between concepts collected documents are processed. Sentences are parsed and semantic of sentences are learned using natural language processing techniques. Different rules are formed to assign different relationships between concepts. The relationship learning algorithm is as follow: Step 1: Process the document and extract sentences one at a time; ignore sentences that do not contain two nouns. Step 2: Check, at least one of the noun should be a concept. If one noun (concept) is already present in the ontology then add second noun (concept) in the ontology and if both the nouns are present then try to find out the relationship between these two noun words (concepts) from that sentence. Step 3: Let C1 and C2 are two concepts extracted in step 2. Preserve their order and all the words in between them. Step 4: Infer the relationship (R) with the help of Verb, Preposition and Adjective word that occur along with these concepts. Step 5: Add relationship between these two concepts (C1, C2, R) in the ontology.
7 CONCLUSION AND FUTURE WORK
In this paper, we have proposed a framework to build ontology in semi-automated way. For building the ontology, domain oriented concepts are automatically extracted from the large corpus dmoz Open Directory and relationship among these concepts are defined in semiautomated way. The relationships that are stored between the concepts are chosen broad so that the developed ontology can be used by various applications. The frame also provides a way to display the ontology in a graph structure so that users can understand the complete ontology at a glance and use it according to their requirements. Future work includes introducing more number of diversified relationships between concepts and extracting them automatically to make the developed ontology more useful.
 T. R. Gruber, "A Translation Approach to Portable Ontology Specifications", Knowledge Acquisition, 5, Academic Press Ltd., pp.199– 220, 1993.
R Mizoguchi, J Vanwelkenhuysen, M Iked, "Task ontology for reuse of problem solving knowledge", In the Proc. of Towards Very Large Knowledge Bases: Knowledge Building & Knowledge Sharing, 1995.  Boryes Omelayenko, "Learning of Ontologies for the web: the analysis of existent Approaches", In the proceedings of the international Workshop on web Dynamics, held in conj. with the 8th International Conference on Database theory (ICDT'01), 2001.  M. Jia, B. Yang, D. Zheng, W. Sun, Li Liu, Jing Yang, “Automatic Ontology Construction Approaches and Its Application on Military Intelligence”, Asia-Pacific Conference on Information Processing (APCIP), vol. 2, Pp. 348 – 351, 2009.  H. Kong, M. Hwang and P. Kim, “Design of the automatic ontology building system about the specific domain knowledge”, 8th International Conference on Advanced Communication Technology (ICACT), 2006.  J. Wang, C. Wang, J. Liu and C. Wu, “Information Extraction for learning of Ontology Instances”, IEEE International Conference on Industrial Informatics, 2006.  Qing Yang,Kai-min Cai, Jun-li Sun, Yan Li, “Design Analysis and Implementation for Ontology Learning Model” , 2nd International Conference on Computer Engineering and Technology, 2010.  Wu yuhuang, Li yusheng, “ Design and Realization for Ontology Learning Model Based on Web”, IEEE International Conference on Information Technology and Computer Science, 2009.  Wen Zhou, Zongtian Liu, Yan Zhao, libin Xu, Guang Chen, Qiang Wu, Mei-li Huang, Yu Qiang, “A Semi-automatic Ontology Learning Based on WordNet and Event-based Natural Language Processing”, ICIA, 2006.  P. K. Bhowmick, Devshri Roy, Sudeshna Sarkar, Anupam Basu, "ʺA Framework for Manual Ontology Engineering for Manage-‐‑ ment of Learning Material Repository"ʺ, International Journal of Computer Science and Applications, Vol. 7 No. 2, pp. 30 -‐‑ 51, 2010.  R. Skrenta, B. Truel, "ʺDmoz Open Directory"ʺ, http://www.dmoz.org, 1998. Jaytrilok Choudhary has obtained the Bachelor of Engineering in the year 2005 from Rajeev Gandhi Technical University Bhopal, Master of Technology in the year 2009 from IIT Roorkee and currently pursuing Ph D from MANIT Bhopal. He is presently working as Asst. Professor in Maulana Azad National Institute of Technology (MANIT), Bhopal. Dr. Devshri Roy has obtained the Bachelor of Engineering in the year 1990, Master of Engineering in the year 1998 and Ph D in the year 2007 from Indian Institute of Technology, Kharagpur, India. She has worked with many prestigious institutes of India like Indian Institute of Technology, National Institute of Technology etc. Currently she is working as an Associate Professor in Maulana Azad National Institute of Technology, Bhopal. Total number of papers published in referred Journals, International conferences and International workshops is 25. Research grant of worth Rupees 9.73 lakhs has been given by Government of India to carry out a research project. Current research interest includes Information Retrieval, natural language processing and application of Artificial Engineering techinques in Electronic and mobile Learning.