Professional Documents
Culture Documents
the document represents a dimension in the space. For two vectors d and d’ the cosine similarity
“Similarly, a query is represented as a vector Q = (t1, between d and d’ is given by:
t2, …, tn) where term ti (1<=i<=n) is a non-negative ( D * D’ ) / | D | * | D’ |
value denoting the number of occurrences of ti (or, Here d X d’ is the vector product of d and d’,
merely a 1 to signify the occurrence of term) in the calculated by multiplying corresponding frequencies
query”. Once both the documents and query have together
their respective vectors calculated it is possible to Step-5: Calculate the similarity measure of query
calculate the distance between the objects in the with every document in the collection
space and the query, allowing objects with similar For Document A, d = (2,1,1,1,0) and d’ = (0,0,0,1,0)
semantic content to the query should be retrieved. dXd’ = 2X0 + 1X0 + 1X0 + 1X1 + 0X0=1
Vector space models that don’t calculate the distance |d| =sqrt (22+12+12+12+02) = sqrt(7)=2.646
between the objects within the space treat each term |d’| = sqrt(02+02+02+12+02) = sqrt(1)=1
independently. Using various similarity measures it Similarity = 1/(1 X 2.646) = 0.378
is possible to compare queries to terms and For Document B, d = (1,0,0,0,1) and d’ = (0,0,0,1,0)
documents in order to emphasize or de-emphasize dXd’ = 1X0 + 0X0 + 0X0 + 0X1 + 1X0=0
properties of the document collection. A good |d| =sqrt (12+02+02+02+12) = sqrt(2)=1.414
example of this is, “the dot product (or, inner |d’| = sqrt(02+02+02+12+02) = sqrt(1)=1
product) similarity measure finds the Euclidean Similarity = 0/(1 X 1.414) = 0
distance between the query and a term or document (d) Ranking documents
in the space”. A user enters a query
Consider the following two documents The query is compared to all documents
Document A: “A man and a woman.” using a similarity measure
Document B: “A baby.” The user is shown the documents in
Step-1: Each document is broken down into a word decreasing order of similarity to the query
frequency table term
Document A: “A man and a woman.” Step-6: Rank the in descending order and display to
A Man And Woman user
2 1 1 1 Document A 0.378
Document B: “A baby.” Document B 0
A Baby 3.1 Variation in VSM
1 1 (e) Stop words
The tables are called vectors and can be stored as Commonly occurring words are unlikely to
arrays give useful information and may be removed
Step-2: A vocabulary is built from all the words in all from the vocabulary to speed processing
documents in the system Stop word lists contain frequent words to be
The vocabulary contains all words used: a, man, and, excluded
woman, baby The top 20 Stop words according to their
Step-3: The vocabulary needs to be sorted: a, and, average frequency per 1000 words: The, of,
man, woman, baby and, to, a, in, that, is, was, he, for, it, with,
Step-4: Each document is represented as a vector as, his, as, on, be, at, by, I, etc. [4]
based against the vocabulary 3.1.1 Term weighting
Vector for Document A: “A man and a woman.” Not all words are equally useful.
A And Man Woman Baby
A word is most likely to be highly relevant
2 1 1 1 0
to document A if it is: Infrequent in other
Vector: (2,1,1,1,0)
documents and Frequent in document A
Vector for Document B: “A baby.”
A And Man Woman Baby
The cosine measure needs to be modified to
1 0 0 0 1 reflect this
Vector: (1,0,0,0,1) Considering the frequency of every word in
Step-5: Queries can be represented as vectors in the a document can do this.
same way as documents It is given by tf = log (1 + n(d, t) / n(d) ),
For example, Woman = (0,0,0,1,0) where n(d, t) is number of occurrences of
(b) Similarity measures/coefficient the term t in document d; and n(d) is the
Using a similarity measure, a set of documents can be number of terms in document d.
compared to a query and the most similar documents
are returned. The similarity in VSM is determined by
using associative coefficients based on the inner
product of the document vector and query vector,
where word overlap indicates similarity.
There are many different ways to measure how
similar two vectors are, like Inner Product, Cosine
Measure, Dice Coefficient, Jaccard Coefficient.
The most popular similarity measure is the cosine
coefficient, which measures the angel between a
document vector and query vector.
(c) The cosine measure
The cosine measure calculates the angle between the
vectors in a high-dimensional virtual space
3.1.2 Normalized term frequency (tf) the document; or the word found nearer to word
A normalized measure of the importance of ‘Abstract’ may define the content of document well.
a word to a document is its frequency, (m) User Profile
divided by the maximum frequency of any User profile can be used to improve the process using
term in the document adaptive approach. The response of the user can also
This is known as the tf factor. be considered to improve the process. This can be
Document A: raw frequency vector: possible if and only if we can measure
(2,1,1,1,0), tf vector: (1, 0.5, 0.5, 0.5, 0) trustworthiness (dependent variable) of user response
This stops large documents from scoring based on some dependent variables like education,
higher age, gender, income, number of time he visited page,
3.1.3 Inverse document frequency (idf) etc.
A calculation designed to make rare words (n) User defined weight to terms in query
more important than common words There can be some way by which user can provide
The idf of word i is given by idf(i) = log (n / weight to his/her query.
n(i)), Where N is the number of documents 3.2 Major Problems with VSM
and n(i) is the number that contain word i There is no real theoretical basis for the
IDF provides high values for rare words and assumption of a term space
low values for common words It is more for visualization that having any
3.1.4 tf-idf real basis
The tf-idf weighting scheme is to multiply Most similarity measures work about the
each word in each document by its tf factor same regardless of model
and idf factor Terms are not really orthogonal dimensions
Different schemes are usually used for query Terms are not independent of all other terms
vectors CONCLUSION
Different variants of tf-idf are also used From the above findings it has been observed that the
Increases with the number of occurrences usefulness of Information Retrieval System is going
within a doc to increase in near future is going to increase from
Increases with the rarity of the term across what it is at this point because with the advancement
the whole corpus. of hardware technology and growth of Internet we
(f) Stemming are having huge amount of data having potential to
Stemming is the process of removing suffixes from produce unused information that too when we are
words to get the common origin. In statistical having the Hardware Technology to handle the
analysis, it greatly helps when comparing texts to be problem in an efficient manner
able to identify words with a common meaning and In this work we have given an insider to the working
form as being identical. For example, we would like of Vector Space Model Techniques used for efficient
to count the words stopped and stopping as being the retrieval techniques. It is bare fact that each systems
same and derived from stop. Stemming identifies has its own strengths and weaknesses what we have
these common forms. sort out in our work for Vector Space Model is the
(g) Synonyms and Multiple meaning of word model is Easy to Understand, cheaper to implement
There are many ways to describe something. considering to the fact that the system should be cost
For example, car and automobile may effective i.e. should follow the Space/Time
describe the same thing. constraint. Also very popular. Although the system is
Words often have multiple meanings. having all these properties it is facing some major
(h) Concept based VSM drawbacks. The Drawbacks are the system yields no
It considers the semantic of the document instead of theoretical findings, Weights associated with the
only considering the terms contained in document. vectors are very arbitrary and this system is an
(i) Proximity independent system. Thus require separate attention.
If the terms occur close to each other in the Though, it is promising technique, current level of
document, the document would be ranked higher than success of the Vector Space Model techniques used
if they occur far apart. for Information Retrieval is not able to satisfy user
The words “Information” and “Retrieval” that comes needs and need extensive attention.
together defines a document on “Information REFERENCES
Retrieval” than the document containing two words 1. A.B.Singh, G.M. Mollick, Vaibhav Kant Singh “A
Production System Approach Acquisition on E-
“Information” and “Retrieval” scattered.
Governance” , Proceeding of Conference on
(j) File Attribute Datamining and E-Governance,GGU Bilaspur,Feb
For temporal data, the file attributes, like last 2007.
modified date, plays an important role in deciding the 2. V.K. Singh, Vinay Kumar Singh “Rural Development
Using concept of Data mining” Tribal Situation In
document relevance.
Middle India: Development and vision, Bilaspur, Oct
(k) Hyperlink 2007.
The hyperlink coming in the page and going out of 3. Vinay Kumar Singh, Vaibhav Kant Singh “The Huge
the page (particularly IN LINK) can give useful Potential of Information Technology” Proceeding of
Global Leadership: Strategies and Challenges for Indian
information about page.
Business, Department of Management Studies GGU,
(l) Position of word Bilaspur, Feb 2007.
The word in header (for example, HTML tag) can be 4. Abraham Silberschatz, Henry F. Korth, and S.
considered more content bearing word than word in Sdarshan, “Database System Concepts”, 4th edition,
Body of document. Similarly, the word occurring in McGraw-Hill.
Note: This Paper/Article is scrutinised and reviewed by
the beginning of document can be more indicative Scientific Committee, BITCON-2015, BIT, Durg, CG, India
about document content than word coming later in
Int. J. Adv. Engg. Res. Studies/IV/II/Jan.-March,2015/141-143