You are on page 1of 8

AALIM MUHAMMED SALEGH

COLLEGE OF ENGINEERING

DEPARTMENT OF INFORMATION TECHNOLOGY

R2017 – SEMESTER - VIII

PROFESSIONAL ELECTIVE - V
CS8080 – INFORMATION RETRIEVAL
TECHNIQUES

UNIT II: MODELING AND RETRIEVAL


EVALUATION
SYLLABUS:

UNIT II: MODELING AND RETRIEVAL EVALUATION 9


Basic IR Models - Boolean Model - TF-IDF (Term Frequency/Inverse Document
Frequency) Weighting - Vector Model – Probabilistic Model – Latent Semantic
Indexing Model – Neural Network Model – Retrieval Evaluation – Retrieval Metrics
– Precision and Recall – Reference Collection – User-based Evaluation – Relevance
Feedback and Query Expansion – Explicit Relevance Feedback.

PART –A ( 2 Marks Questions)

1) What do you mean information retrieval models? [BTL 2] [CO1]


Definition. A model of information retrieval (IR) selects and ranks the relevant documents with
respect to a user's query.
A retrieval model can be a description of either the computational process or the human process
of retrieval: The process of choosing documents for retrieval; the process by which information
needs are first articulated and then refined.

2) List groups of Information retrieval models. [BTL 6] [CO1]


a) Classical IR Models
b) Non- Classical IR Models
c) Alternate IR Models

3) What are the three classic models in information retrieval system? [BTL 1] [CO1]
1. Boolean model
2. Vector Space model
3. Probabilistic model

4) What is the basis for Boolean model? [BTL 1] [CO1]


 Simple model based on set theory and Boolean algebra
 Documents are sets of terms
 Queries are specified as Boolean expressions on terms

5) How can we represent the queries in boolean model? [BTL 6] [CO1]


Queries specified as boolean expressions
Precise semantics
Neat formalism
q = ka
6) Analyze the Boolean model. [BTL 5] [CO1]
Index term weight variables all are binary

 wij {0,1}
 Query q = ka b c)
 sim(qi,dj) =1 , i.e. doc’s are relevant

0, otherwise i.e. doc’s are not relevant


q = ka b c) , can be written as disjunctive normal form
,vec(qdnf)= (1,1,1) v (1,1,0) v (1,0,0)

7) What are the advantages of Boolean model? [BTL 2] [CO1]


 Clean Formalism
 Easy to implement
 Intuitive concept
 Still it is dominant model for document database systems

8) What are the disadvantages of Boolean model? [BTL 2] [CO1]


 Exact matching may retrieve too few or too many documents
 Difficult to rank output, some documents are more important than others
 Hard to translate a query into a Boolean expression
 All terms are equally weighted
 More like data retrieval than information retrieval
 No notion for partial matching
 It is not simple to translate an information need into a Boolean expression Exact matching
may lead to retrieval of too many documents.
 The retrieved documents are not ranked. The model does not use term weights.

9) What is cosine similarity? [BTL 1] [CO1]


This metric is frequently used when trying to determine similarity between two
documents. Since there are more words that are in common between two documents, it is
useless to use the other methods of calculating similarities.

10) What is language model based IR? [BTL 1] [CO1]


A language model is a probabilistic mechanism for generating text. Language models
estimate the probability distribution of various natural language phenomena.

11) Define unigram language. [BTL 1] [CO1]


A unigram (1-gram) language model makes the strong independence assumptionthat
words are generated independently from a multinomial distribution

12) Describe the characteristics of relevance feedback. [BTL 3] [CO1]


It shields the user from the details of the query reformulation process.
It breaks down the whole searching task into a sequence of small steps which areeasier to
grasp.
Provide a controlled process designed to emphasize some terms and de-emphasize others.

13) What are the assumptions of vector space model? [BTL 2] [CO1]
Assumption of vector space model:
The degree of matching can be used to rank-order documents;
This rank-ordering corresponds to how well a document satisfying a usersinformation needs.

14) Define term frequency. [BTL 1] [CO1]


Term frequency: Frequency of occurrence of query keyword in document.

15) Explain Luhn’s ideas [BTL 1] [CO1]


Luhn’s basic idea to use various properties of texts, including statistical ones, was critical
in opening handling of input by computers for IR. Automatic input joined the already automated
output.

16) Define stemming. [BTL 1] [CO1]


Conflation algorithms are used in information retrieval systems for matching the
morphological variants of terms for efficient indexing and faster retrieval operations. The
Conflation process can be done either manually or automatically. The automatic conflation
operation is also called stemming.

17) What is Recall? [BTL 1] [CO1]


Recall is the ratio of the number of relevant documents retrieved to the total number of
relevant documents retrieved.
18) What is precision? [BTL 1] [CO1]
Precision is the ratio of the number of relevant documents retrieved to the total number
of documents retrieved.

19) Explain Latent semantic Indexing. [BTL 2] [CO1]


Latent Semantic Indexing is a technique that projects queries and documents into a
space with “latent” Semantic dimensions. It is statistical method for automatic indexingand
retrieval that attempts to solve the major problems of the current technology. It is intended to
uncover latent semantic structure in the data that is hidden. It creates asemantic space where in
terms and documents that are associated are placed near one another.

PART – B (13 MARK QUESTIONS)

1) Explain in detail about vector-space retrieval models with an example?


[BTL 2] [CO1]
2) Explain about Boolean model for IR. [BTL 2] [CO1]
3) Explain about Probabilistic IR Method in detail. [BTL 2] [CO1]
4) Explain about Latent Semantic Indexing method in detail. [BTL 2] [CO1]
5) Give brief notes about user Relevance feedback method and how it is used in query
expansion. [BTL 5] [CO1]
6) Describe the document pre-processing steps in detail.
7) Illustrate the vector space retrieval model with example.
8) Describe about basic concepts of cosine similarity.
9) Develop on example to implement term weighting (min docs=5).
10) Discuss the Boolean retrieval in detail with diagram.
11) Discuss in detail about term frequency and inverse document frequency.
12) Explain latent semantic indexing and latent semantic space with an illustration.
13) Examine how to form a binary term-document incidence matrix. 2.give an example
for an above.
14) Describe document pre-processing and its stages in detail.
15) Discuss the structure of inverted indices and the basic Boolean Retrieval model.
16) Discuss the searching process in inverted file.
17) Compare language model-based information retrieval and its probabilistic
representation.
18) Explain in detail about binary independence model for probability ranking
principles.
PART – C ( 14 MARK QUESTIONS)
1) i)Describe the formal characterization of IR Models (7) [BTL 02] [CO1]

ii) Write the advantages and disadvantages for classic models which are used in IR.
(7) [BTL 5] [CO1]
2) Sort and rank the documents in descending order according to the similarity values:
Suppose we query an IR system for the query "gold silver truck"
The database collection consists of three documents (D = 3) with the following content,
D1: "Shipment of gold damaged in a fire“
D2: "Delivery of silver arrived in a silver truck“
D3: "Shipment of gold arrived in a truck" (14) [BTL 6] [CO1]
Answer:
 Finally we sort and rank the documents in descending order according to the
similarityvalues
Rank R 1: Doc 2 = 0.8246
Rank a 2: Doc 3 = 0.3271
n
k 3: Doc 1 = 0.0801
3) Estimate sparse vectors and its efficiency with diagram. (7) [BTL 02] [CO1]

4) Illustrate the following: (5+5+4) [BTL 02] [CO1]


a) Probabilistic relevance feedback.
b) Pseudo relevance feedback.
c) Indirect relevance feedback.
*********************

You might also like