Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
3Activity
0 of .
Results for:
No results containing your search query
P. 1
Design of Content-Oriented Information Retrieval by Semantic Analysis

Design of Content-Oriented Information Retrieval by Semantic Analysis

Ratings: (0)|Views: 259 |Likes:
Published by ijcsis
The existing Information Retrieval (IR) systems which are based entirely on syntactic (keyword based) contents have serious limitations like irrelevant document retrieval, word sense ambiguity, low precision and recall ratio since the complete semantics of the contents are not represented. To overcome these limitations, from the recent literature it is identified that it is necessary to analyze and determine the semantic features of both the content in document and query. Hence in this paper it is proposed to initially develop a semantic pattern that represents semantic features of the contents in every document in the corpus as a Term Document Matrix (TDM) format. Then to develop a semantic pattern for the contents in the query by incorporating it with Natural Language Processing technique along with Synset (WordNet) for query refinement & expansion. Now the similarity between the semantic pattern of the query and TDM is calculated using Latent Semantic Analysis (LSA) and plotted in Semantic Vector Space. Then by matching against the vector space, contents associated to the query can be identified in the corresponding cluster. Various experimental results are carried on, which shows the increase in document retrieval recall and precision rates, thereby demonstrating the effectiveness of the model.
The existing Information Retrieval (IR) systems which are based entirely on syntactic (keyword based) contents have serious limitations like irrelevant document retrieval, word sense ambiguity, low precision and recall ratio since the complete semantics of the contents are not represented. To overcome these limitations, from the recent literature it is identified that it is necessary to analyze and determine the semantic features of both the content in document and query. Hence in this paper it is proposed to initially develop a semantic pattern that represents semantic features of the contents in every document in the corpus as a Term Document Matrix (TDM) format. Then to develop a semantic pattern for the contents in the query by incorporating it with Natural Language Processing technique along with Synset (WordNet) for query refinement & expansion. Now the similarity between the semantic pattern of the query and TDM is calculated using Latent Semantic Analysis (LSA) and plotted in Semantic Vector Space. Then by matching against the vector space, contents associated to the query can be identified in the corresponding cluster. Various experimental results are carried on, which shows the increase in document retrieval recall and precision rates, thereby demonstrating the effectiveness of the model.

More info:

Published by: ijcsis on Feb 15, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

08/04/2011

pdf

text

original

 
 
DESIGN OF CONTENT ORIENTED INFORMATION RETRIEVAL BASEDON SEMANTIC ANALYSIS
S
.
Amudaria, S.Sasirekha,
 
PG Student, Asst. Professor,
 
Dept of IT, SSN College of Engineering, Dept of IT, SSN College of Engineering,Chennai, India Chennai, Indiadaria.amu@gmail.com sasirekhas@ssn.edu.in
Abstract:The existing Information Retrieval (IR) systems whichare based entirely on syntactic (keyword based) contents haveserious limitations like irrelevant document retrieval, wordsense ambiguity, low precision and recall ratio since thecomplete semantics of the contents are not represented. Toovercome these limitations, from the recent literature it isidentified that it is necessary to analyze and determine thesemantic features of both the content in document and query.Hence in this paper it is proposed to initially develop asemantic pattern that represents semantic features of thecontents in every document in the corpus as a Term DocumentMatrix (TDM) format. Then to develop a semantic pattern forthe contents in the query by incorporating it with NaturalLanguage Processing technique along with Synset (WordNet)for query refinement & expansion. Now the similarity betweenthe semantic pattern of the query and TDM is calculated usingLatent Semantic Analysis (LSA) and plotted in SemanticVector Space. Then by matching against the vector space,contents associated to the query can be identified in thecorresponding cluster. Various experimental results arecarried on, which shows the increase in document retrievalrecall and precision rates, thereby demonstrating theeffectiveness of the model.
Keywords:
Information retrieval, Semantic extraction,Query extension, Query matching
 I INTRODUCTION
The existing information retrieval systems are mostlykeyword-based and identify relevant documents orinformation by matching keywords. Keyword-based search,in spite of its merits of expedient query for information andease-of-use, has failed to represent the complete semanticscontained in the content (Oh et al, 2007) and has led to thefollowing problems (Abdelali et al, 2007; Moreale et al,2004): (1) keywords could represent only fragmentedmeanings of the content, and the content identified throughkeywords did not always meet the querist requirements. Thequerist had to screen retrieval results and correct keywordsseveral times to obtain the required information. (2)Compared to a text, a query usually comprised fewercontents, which might lead to wrong retrieval results due toproblems like insufficient information being used in thesearch process, insufficient query topics, and difficulty indetermining query features. (3) Due to synonym andpolysemy in human language, information retrieval throughkeywords can only cover information containing the samekeyword, while other information with similar semantics butdifferent keywords has been completely left out. The usernormally goes to the search engine to get the exact andrelevant results. But the current search engine is notresponsible for producing the accurate results to the user.Semantic search seeks to improve search accuracy byunderstanding searcher intent and the contextual meaning of terms as they appear in the searchable data space, whetheron the Web or within a closed system, to generate morerelevant results. Rather than using ranking algorithms suchas Google's Page Rank to predict relevancy, SemanticSearch uses semantics, or the science of meaning inlanguage, to produce highly relevant search results. In mostcases, the goal is to deliver the information queried by auser rather than have a user sort through a list of looselyrelated keyword results. Here WordNet is used to get thesemantics of the query.A brief literature survey about the information retrievaltechniques are done in the section II, then the proposedsystem is explained using various techniques in section IIIand the next two sections deals completely about theimplementation and Test results.
II RELATED WORK
Ming-Yen Chen et al. (2009) introduces a semanticenabled information retrieval in which a web corpus is takenand the related information is retrieved. The limitation of this project is that it won’t deals about the Synonyms orSynsets. Here in our project we have concentrated onWordNet ontology to collect more senses.Zongli Jiang et al. (2009) introduce the concept of category attribute of a word. According to the categoryattribute of a word, the useless results can be removed fromthe search results and the retrieval efficiency will beimproved. Latent semantic analysis is a method that candiscover the underlying semantic relation between wordsand documents. Singular value decomposition is used inlatent semantic analysis to analyze the words and documentsand get the semantic relation finally.Hongwei Yang et al. (2010) can enable the users tofind the relevant documents more easily and also help usersto form an understanding of the different facets of the querythat have been provided for web search engine. A popular
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, January 201192http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
technique for clustering is based on K-means such that thedata is partitioned into K clusters. In this method, the groupsare identified by a set of points that are called the clustercenters. The data points belong to the cluster whose center isclosest
.
The algorithm used in the proposed system is K-means clustering algorithm.Gang et al. (2009) proposed a method to enhance theinformation retrieval recall and precision. To filter out thedocument which have smaller related degree with originalquery, the scores of search results document is re-calculatedby use of ontology semantic similarity. A new definition of the iterative query expansion parameters is put forwardwhich can reduce the number of expansion and furtherimprove the efficiency of the query.Trong Hai et al. (2008) proposed a system whichapplies the relations between entities discovered from Textcorpus to ontology integration tasks in which the nounphrase (NP) is used to identify its head noun; this is usefulto avoid wrong relations between entities. It also proposes acollaborative acquisition algorithm combining WordNet-based and Text corpus to provide general concepts and theirrelations for ontology integration tasks.Trong Hai Duong, Geun Sik Jo (2009) designed a newmeasure based on semantic ontology database WordNet isproposed, which combines information content-basedmeasure and the edge-counting techniques to measuresemantic similarity. “PART-OF” and “IS-A” hierarchicalrelations’ influence are considered on the semanticsimilarity in this paper. Breadth-first search is used to findthe shortest path between two concepts. The similarity of hierarchy and superposition are calculated respectively.
III PROPOSED SYSTEM
The proposed system uses the semantic analysistechnique to retrieve the content which is relevant to the userquery. The user’s query will be analyzed in the semanticextraction and determination module to extract its semanticfeatures for the purpose of determining contents of the queryand representing them in a structured and materializedsemantic pattern. In this component the semantic elementsare identified and analyze their semantic relations, to befollowed by the integration and simplification of semanticrelations with Word Net. Now the semantic extensionmodule will identify other potentially relevant semanticfeatures based on semantic features of the query and includethem into the query patterns. This will increase the numberof semantic features in the query as the basis for matching.The input query from the user is processed usingpreprocessing techniques such as stop word list removal andthen stemming is done. Each and every processed word ispassed to the WordNet to collect all the other senses that thecorresponding word has. The Synsets related to the query aretaken and latent semantic Analysis process is done to indexthe documents.The Singular value decomposition will process thedocument in the corpus and the term document frequencymatrix is generated. This term document frequency matrix isplotted and most similar terms that are corresponding to thequery will be plotted in the semantic space. Then finally therelevant documents are obtained by using the k-meansclustering. The block diagram of the proposed system isshown below.
Fig1: Block diagram of the proposed system
Consider the word Java. The corresponding senses of word java that are taken from WordNet are given below.Word SensesJavaan island in Indonesia south of Borneo; one of the world's most densely populated regionsJavaCoffee- a beverage consisting of an infusionof ground coffee beans; "he ordered a cup of coffee"Javaa simple platform-independent object-oriented programming language used forwriting applets that are downloaded from theWorld Wide Web by a client and run on theclient's machineIt means that the single word java has three senses. Thistype of word ambiguity is not satisfied by the current searchengines. Also consider another example. The query given bythe user is Computer. Both the words PC and Computer referto a same thing. But in the current search engines, only thedocuments containing the word computer will be indexed andretrieved to the user. So even though the word PC resemblesthe same meaning, the pages relevant to word PC are notretrieved. Hence precision and recall ratio is minimized.The proposed system is to design a content basedinformation retrieval based on semantic Analysis where weuse WordNet ontology for performing a search based onSynsets and thus to increase the precision and recall ratio.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, January 201193http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
Precision: What fraction of the returned results is relevant tothe information need?Recall: What fraction of the relevant documents in thecollection was returned by the system?
IV IMPLEMENTATION
The system is implemented by using a corpus of 250documents. The query is given as input and the processingsteps are explained below. The final output obtained is therelevant document that exactly matches with the Query.
 
Semantic Pattern Construction
 
Semantic Query Processing
 
Semantic Query Refinement & Expansion
 
Semantic Pattern Matching
1.
 
Develop Semantic Pattern from the content
Developing a semantic pattern from the contentrequires the following steps. The given content is pre-processed using the porter stemming algorithm to find theroot word and removing the stop words. The stop words aregiven manually which doesn’t make any senses in thecontent and query.
i.
 
Content Preprocessing
A content repository of 250 text documents istaken as corpus. These documents are to be processed in totokens. Some selected stop words are taken. These stopwords are discarded by the search engine. All the textdocuments that are present in the corpus are passed throughthese stop word list. The document word that matches withthe stop word is considered to be stop word and iseliminated. This step is done to reduce the token. Theremaining word is considered to be keyword and is stored ina text file. Normally the stop words will be pronouns,Articles and Prepositions.
ii.
 
Porter Stemming Algorithm
After removing the stop words thekeywords are passed to a stemming Algorithm. Thestemming Algorithm used in this work is
Portar Stemming Algorithm.
This component identified semantics elementslike subject, object, and predicate in the content semanticsand analyzes their semantic relations.
iii.
 
Term Document Matrix
Generate a Term Document Matrix to know theoccurrences of each and every key word in the document.The term-document matrix is a large grid representing everydocument and content word in a collection. The TDM(Term Document Matrix) is generated by arranging the listof all content words along the vertical axis, and a similar listof all documents along the horizontal axis. These need notbe in any particular order, as long as it is kept track of whichcolumn and row corresponds to which keyword anddocument.
2.
 
Query Refinement and Expansion using WordNeti.
 
Query Refinement
The query entered by the user is passed throughthe stop word list to remove the stop words. Then stemmingis also done to retrieve only the subject. This is passed to theWordNet to get more senses. For example, the word vomithas 3 senses such as vomits, barf and puke. In the keywordbased search only the vomit word will be taken but not itssenses. Hence different words expressing the same meaningwill not be taken and so the user won’t be satisfied with theresults of search engine. Hence pass each and every token of the query to the WordNet to get more senses.
ii.
 
Query Vector Coordinates
The query vector coordinates are generated bychecking the keyword txt file and count the occurrences of it. The senses are also counted and hence the count isincremented. The goal of WordNet project is the creation of dictionary and thesaurus, which could be used intuitively.The next purpose of WordNet is the support for automatictext analysis and artificial intelligence. WordNet is a lexicaldatabase for the English language. It groups English wordsinto sets of synonyms called Synsets, provides short, generaldefinitions, and records the various semantic relationsbetween these synonym sets. The purpose is twofold: toproduce a combination of dictionary and thesaurus that ismore intuitively usable, and to support automatic textanalysis and artificial intelligence applications. WordNetdistinguishes between nouns, verbs, adjectives and adverbsbecause they follow different grammatical rules. EverySynset contains a group of synonymous words orcollocations (a collocation is a sequence of words that gotogether to form a specific meaning, such as "car pool");different senses of a word are in different Synsets.
 
A query
Q
is represented as an
n
-dimensionalvector
q
in the same vector space as the document vectors.There are
 
several ways how to search for relevantdocuments. Generally,
 
we can compute matrix to representthe similarity
 
of query and document vectors.
3.
 
Perform SVD and LSAi.
 
Term Frequency – Inverse DocumentFrequency
After constructing the Term Document Matrixapply weight to all token found in countMatrix. TheTFIDF (Term Frequency – Inverse Document Frequency)is calculated using the formula
TFIDFi,j = ( Ni,j / N*,j ) * log( D / Di )
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, January 201194http://sites.google.com/site/ijcsis/ISSN 1947-5500

Activity (3)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
Fakhreza Akbar liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->