Precision: What fraction of the returned results is relevant tothe information need?Recall: What fraction of the relevant documents in thecollection was returned by the system?
The system is implemented by using a corpus of 250documents. The query is given as input and the processingsteps are explained below. The final output obtained is therelevant document that exactly matches with the Query.
Semantic Pattern Construction
Semantic Query Processing
Semantic Query Refinement & Expansion
Semantic Pattern Matching
Develop Semantic Pattern from the content
Developing a semantic pattern from the contentrequires the following steps. The given content is pre-processed using the porter stemming algorithm to find theroot word and removing the stop words. The stop words aregiven manually which doesn’t make any senses in thecontent and query.
A content repository of 250 text documents istaken as corpus. These documents are to be processed in totokens. Some selected stop words are taken. These stopwords are discarded by the search engine. All the textdocuments that are present in the corpus are passed throughthese stop word list. The document word that matches withthe stop word is considered to be stop word and iseliminated. This step is done to reduce the token. Theremaining word is considered to be keyword and is stored ina text file. Normally the stop words will be pronouns,Articles and Prepositions.
Porter Stemming Algorithm
After removing the stop words thekeywords are passed to a stemming Algorithm. Thestemming Algorithm used in this work is
Portar Stemming Algorithm.
This component identified semantics elementslike subject, object, and predicate in the content semanticsand analyzes their semantic relations.
Term Document Matrix
Generate a Term Document Matrix to know theoccurrences of each and every key word in the document.The term-document matrix is a large grid representing everydocument and content word in a collection. The TDM(Term Document Matrix) is generated by arranging the listof all content words along the vertical axis, and a similar listof all documents along the horizontal axis. These need notbe in any particular order, as long as it is kept track of whichcolumn and row corresponds to which keyword anddocument.
Query Refinement and Expansion using WordNeti.
The query entered by the user is passed throughthe stop word list to remove the stop words. Then stemmingis also done to retrieve only the subject. This is passed to theWordNet to get more senses. For example, the word vomithas 3 senses such as vomits, barf and puke. In the keywordbased search only the vomit word will be taken but not itssenses. Hence different words expressing the same meaningwill not be taken and so the user won’t be satisfied with theresults of search engine. Hence pass each and every token of the query to the WordNet to get more senses.
Query Vector Coordinates
The query vector coordinates are generated bychecking the keyword txt file and count the occurrences of it. The senses are also counted and hence the count isincremented. The goal of WordNet project is the creation of dictionary and thesaurus, which could be used intuitively.The next purpose of WordNet is the support for automatictext analysis and artificial intelligence. WordNet is a lexicaldatabase for the English language. It groups English wordsinto sets of synonyms called Synsets, provides short, generaldefinitions, and records the various semantic relationsbetween these synonym sets. The purpose is twofold: toproduce a combination of dictionary and thesaurus that ismore intuitively usable, and to support automatic textanalysis and artificial intelligence applications. WordNetdistinguishes between nouns, verbs, adjectives and adverbsbecause they follow different grammatical rules. EverySynset contains a group of synonymous words orcollocations (a collocation is a sequence of words that gotogether to form a specific meaning, such as "car pool");different senses of a word are in different Synsets.
is represented as an
in the same vector space as the document vectors.There are
several ways how to search for relevantdocuments. Generally,
we can compute matrix to representthe similarity
of query and document vectors.
Perform SVD and LSAi.
Term Frequency – Inverse DocumentFrequency
After constructing the Term Document Matrixapply weight to all token found in countMatrix. TheTFIDF (Term Frequency – Inverse Document Frequency)is calculated using the formula
TFIDFi,j = ( Ni,j / N*,j ) * log( D / Di )
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 1, January 201194http://sites.google.com/site/ijcsis/ISSN 1947-5500