You are on page 1of 3

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

HYDERABAD CAMPUS
FIRST SEMESTER 2020 – 2021
INFORMATION RETREIVAL (CS F469) – TEST-2 SOLUTION

Date: 20.11.2020 Weightage: 12% [24 Marks] Duration: 30mins. Type: Open Book

Q1. Given the following SVD for an input matrix of size mXn (where m represents document and n
represent the terms). Answer questions A to C. [3 M]

A. How many singular values need to be retained if 85% of variance has to be preserved in the data?
17.92+15.17 / 17.92+15.17+3.56 = 90% Hence it is enough to retain the first two singular values to
preserve 85% of the variance.
B. What will the size of the matrices U,Σ and VT if 85% of variance is preserved?
After retaining the first two singular values the sizes are as follows
U will be of size 5 X 2
Σ will be of size 2 X 2
VT will be of size 2 X 5
C. Is it possible to apply SVD on a matrix of any size?
Yes SVD can be applied to matrix of any size.

Q2. Given the following Shingles and the document matrix. Answer questions A to C.
doc_1 doc_2 doc_3
S1 1 1 1
S2 1 0 1
S3 1 0 0
S4 1 0 0
S5 1 0 1
S6 1 0 0
S7 1 0 1
S8 0 1 0
S9 0 1 0
S10 0 1 0
S11 0 1 0
S12 0 0 1
S13 0 0 1
S14 0 0 1
S15 0 0 1
S16 0 0 1

A. Compute the Jaccard similarity between doc_1 and doc_3 assuming that the shingles are asymmetric
binary random variables. [1+3+4 =8M]
3/12 = 0.25

B. What will be signature generated if the following permutation is taken


s16,s15,s14,s13,s12,s11,s10,s9,s8,s7,s6,s5,4,s3,s2,s1?
7 11 16 or S7 S11 S16

C. If the following 4 hash functions what are the entries of the first row in the signature matrix?
(12x+8)mod17
(14x+3)mod17
(11x+5)mod17
(16x+7)mod17
Any of the following would be considered.
If the indexing is from 16 to 1 the following signature would be 0 2 0
If the indexing is from 0 to 15 the following signature would be 0 2 1
If the indexing is from 1 to 16 the following signature would be 0 2 0

Q3. A search Engine-A has returned 10 documents for a user query. If the total number of documents
relevant to this query are 5 and the following is the list of Relevant(R) and Non-Relevant(NR) documents
in a ranked order. [1+2=3M]

R NR NR R R NR NR R NR R

A. What will the precision and recall at rank 4?

Precision = 2/4 = 0.5


Recall = 2/5 = 0.4

B. If the same query is run on another search Engine-B and it returns the following results R NR R NR R R
R NR NR NR would you prefer to query on search engine-A or search Engine-B? Justify your answer in
one line.
Search Engine-B is preferable over A since as a user I am able to see all relevant documents at rank
5.

C. You are hired as expert to develop a recommender system for an ELearning platform. It has a very small
user base as it has recently been launched. What type of recommender systems is suitable? Justify you
answer in one line. [2 M]

In this case a content based recommender system would be suitable since the ELearning platform is
recently launched and do not have enough user data to learn.

D. What is the advantage of combining the collaborative filtering with the baseline approach? Justify you
answer in one line. [2 M]
It helps us combine Global and local effects of user preferences.

E. Once the utility matrix A is decomposed into U, Σ, VT using SVD under what conditions is the
matrix A invertible (i.e we can get back original A using U, Σ, VT ) and how? [2 M]

A is invertible if we don’t reduce the dimensions of Σ (or concepts with less variance)

F. A Latent Factor Model is learnt using 5 User and Item attributes. If 10,000 ratings are used as training
examples from the original matrix, how many parameters have to be learnt by Stochastic Gradient Descent
model? [3 M]
Since each user rantings are factorized into 5 user and 5 Item attributes/factors if 10,000 ratings are
to be used as training data the SGD model will have to learn 10X10,000 = 1,00,000 parameters.

You might also like