Professional Documents
Culture Documents
IR Solutions Combined
IR Solutions Combined
Solution:
(a) Compute the tokens for each document
d1 = “Big—cats—are—nice—and—funny”
d2 = “Small—dogs—are—better—than—big—dogs”
d3 = “Small—cats—are—afraid—of—small—dogs”
d4 = “Big—cats—are—not—afraid—of—small—dogs”
d5 = “Funny—cats—are—not—afraid—of—small—dogs”
(b) Normalize the tokens with respect to plurals and upper/lower case
d1 = “big—cat—is—nice—and—funny”
d2 = “small—dog—is—better—than—big—dog”
d3 = “small—cat—is—afraid—of—small—dog”
d4 = “big—cat—is—not—afraid—of—small—dog”
d5 = “funny—cat—is—not—afraid—of—small—dog”
(c) Compute the dictionary relative to the documents collection
Dictionary = {big,cat,is,nice,and,funny,small,dog,better,than,afraid,of,not}
2. Starting from the documents collection of Exercise 1, build the documents-terms incidence matrix as
required by the Boolean model.
1
better
afraid
funny
small
than
nice
and
dog
not
big
cat
of
is
d1 1 1 1 1 1 1 0 0 0 0 0 0 0
Solution: d2 1 0 1 0 0 0 1 1 1 1 0 0 0
d3 0 1 1 0 0 0 1 1 0 0 1 1 0
d4 1 1 1 0 0 0 1 1 0 0 1 1 1
d5 0 1 1 0 0 1 1 1 0 0 1 1 1
Solution:
(a) Aswer the query q1 = funny AND dog
Rfunny = {d1 , d5 }
Rdog = {d2 , d3 , d4 , d5 }
q1 → Rfunny ∩ Rdog = {d5 }
(b) Aswer the query q2 = nice OR dog
Rnice = {d1 }
Rdog = {d2 , d3 , d4 , d5 }
q2 → Rnice ∪ Rdog = {d1 , d2 , d3 , d4 , d5 }
(c) Aswer the query q3 = big AND dog AND NOT funny
Rbig = {d1 , d2 , d4 } Rdog = {d2 , d3 , d4 , d5 } Rfunny = {d1 , d5 }
C
q3 → (Rbig ∩ Rdog ) ∩ Rfunny = {d2 , d4 } ∩ {d2 , d3 , d4 } = {d2 , d4 }
(d) Translate query q3 into a Disjunctive Normal Form considering a
dictionary = {big,cat,funny,small,dog}
funny
small
dog
big
cat
1 0 0 0 1
1 0 0 1 1
1 1 0 0 1
1 1 0 1 1
Page 2
4. Starting from the documents collection of Exercise 1, build the documents-terms weights matrix using
as term-frequency:
(a) the number of occurrences of the term in each document
(b) the normalized number of occurrences
(c) the logarithmic number of occurrences
Solution:
Compute the inverse document frequency for each term
ti idfi
big log2 (5/3)
cat log2 (5/4)
is log2 (5/5)
nice log2 (5/1)
and log2 (5/1)
funny log2 (5/2)
small log2 (5/4)
dog log2 (5/4)
better log2 (5/1)
than log2 (5/1)
afraid log2 (5/3)
of log2 (5/3)
not log2 (5/2)
Since the term “is” appears in all the documents we can safely ignore it from here on.
Compute the number of occurrences of each term in each document
better
afraid
funny
small
than
nice
and
dog
not
big
cat
of
d1 1 1 1 1 1 0 0 0 0 0 0 0
d2 1 0 0 0 0 1 2 1 1 0 0 0
d3 0 1 0 0 0 2 1 0 0 1 1 0
d4 1 1 0 0 0 1 1 0 0 1 1 1
d5 0 1 0 0 1 1 1 0 0 1 1 1
Remember that wi,j = tfi,j · idfi
Compute the documents-terms weights matrix using as term-frequency:
(a) the number of occurrences of the term in each document [tfi,j = f reqi,j ]
funny
small
nice
and
big
cat
d1 log2 (5/3) log2 (5/4) log2 (5/1) log2 (5/1) log2 (5/2) 0
d2 log2 (5/3) 0 0 0 0 log2 (5/4)
d3 0 log2 (5/4) 0 0 0 2 log2 (5/4)
d4 log2 (5/3) log2 (5/4) 0 0 0 log2 (5/4)
d5 0 log2 (5/4) 0 0 log2 (5/2) log2 (5/4)
Page 3
better
afraid
than
dog
not
of
d1 0 0 0 0 0 0
d2 2 log2 (5/4) log2 (5/1) log2 (5/1) 0 0 0
d3 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) 0
d4 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)
d5 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)
f reqi,j
(b) the normalized number of occurrences [tfi,j = maxi f reqi,j
]
For each document compute the maximum number of occurrences among all the terms
dj maxi f reqi,j
d1 1
d2 2
d3 2
d4 1
d5 1
Compute the documents-terms weights matrix
funny
small
nice
and
big
cat
d1 log2 (5/3) log2 (5/4) log2 (5/1) log2 (5/1) log2 (5/2) 0
d2 0.5 log2 (5/3) 0 0 0 0 0.5 log2 (5/4)
d3 0 0.5 log2 (5/4) 0 0 0 log2 (5/4)
d4 log2 (5/3) log2 (5/4) 0 0 0 log2 (5/4)
d5 0 log2 (5/4) 0 0 log2 (5/2) log2 (5/4)
better
afraid
than
dog
not
of
d1 0 0 0 0 0 0
d2 log2 (5/4) 0.5 log2 (5/1) 0.5 log2 (5/1) 0 0 0
d3 0.5 log2 (5/4) 0 0 0.5 log2 (5/3) 0.5 log2 (5/3) 0
d4 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)
d5 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)
(c) the logarithmic number of occurrences [tfi,j = 1 + log2 f reqi,j ]
Please note that
1 + log2 1 = 1 + 0 = 1
1 + log2 2 = 1 + 1 = 2
funny
small
nice
and
big
cat
d1 log2 (5/3) log2 (5/4) log2 (5/1) log2 (5/1) log2 (5/2) 0
d2 log2 (5/3) 0 0 0 0 log2 (5/4)
d3 0 log2 (5/4) 0 0 0 2 log2 (5/4)
d4 log2 (5/3) log2 (5/4) 0 0 0 log2 (5/4)
d5 0 log2 (5/4) 0 0 log2 (5/2) log2 (5/4)
Page 4
better
afraid
than
dog
not
of
d1 0 0 0 0 0 0
d2 2 log2 (5/4) log2 (5/1) log2 (5/1) 0 0 0
d3 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) 0
d4 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)
d5 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)
5. Starting from the documents collection of Exercise 1, rank the documents with respect to query q =
{big—cat—afraid} using the normalized term-frequency model (Exercise 4b).
(a) Use the Eucledian distance
(b) Use Cosine similarity
(c) Use Jaccard similarity
Solution:
Recall the normalized term-document weights matrix
better
afraid
funny
small
than
nice
and
dog
not
big
cat
of
d1 0.74 0.32 2.32 2.32 1.32 0 0 0 0 0 0 0
d2 0.37 0 0 0 0 0.16 0.32 1.16 1.16 0 0 0
d3 0 0.16 0 0 0 0.32 0.16 0 0 0.37 0.37 0
d4 0.74 0.32 0 0 0 0.32 0.32 0 0 0.74 0.74 1.32
d5 0 0.32 0 0 1.32 0.32 0.32 0 0 0.74 0.74 1.32
Compute the query vector
better
afraid
funny
small
than
nice
and
dog
not
big
cat
of
Page 5
SC(q, dj )
d1 0.16
d2 0.14
d3 0.45
d4 0.57
d5 0.27
Ranking: d4 > d3 > d5 > d1 > d2
(c) Use Jaccard similarity
SC(q, dj )
d1 0.05
d2 0.07
d3 0.25
d4 0.32
d5 0.12
Ranking: d4 > d3 > d5 > d2 > d1
6. Starting from the documents collection of Exercise 1, rank the documents with respect to query q =
{afraid—cat—funny} using the probabilistic model under the binary independence assumption.
Initialize the probability of term ti of appearing in a document relevant to the query (pi ) with even
odds.
Initialize the probability of term ti of appearing in a document not relevant to the query (ui ) assuming
that all documents are non relevant.
In the following iterations assume that the top-2 documents are relevant. In case of ties order the
documents (di ) by increasing index i.
Solution:
1. Consider only the terms appearing in the query
d1 = “cat—funny”
d2 = “”
d3 = “cat—afraid”
d4 = “cat—afraid”
d5 = “funny—cat—afraid”
Code the terms as t1 = “afraid”, t2 = “cat” and t3 = “funny”. N = 5 documents in the
collection.
Initialize the probability of term ti of appearing in documents non relevant to the query.
ui = nNi for each term ti , where ni is the number of documents containing term ti .
u1 = 35 = 0.6
u2 = 45 = 0.8
u3 = 25 = 0.4
Initialize the probability of term ti of appearing in documents relevant to the query.
p1 = 0.5
p2 = 0.5
p3 = 0.5
Page 6
2. 1st iteration. Compute the Similarity Coefficient between each document and the query
p2 p3
SC(d1 , q) = log2 1−p 2
+ log2 1−u
u2
2
+ log2 1−p 3
+ log2 1−u
u3
3
= −1.42
SC(d2 , q) = −∞ (no query terms)
p1 p2
SC(d3 , q) = log2 1−p 1
+ log2 1−u
u1
1
+ log2 1−p 2
+ log2 1−u
u2
2
= −2.58
p1 1−u1 p2 1−u2
SC(d4 , q) = log2 1−p1 + log2 u1 + log2 1−p2 + log2 u2 = −2.58
p1 p2 p3
SC(d5 , q) = log2 1−p 1
+ log2 1−u
u1
1
+ log2 1−p 2
+ log2 1−u
u2
2
+ log2 1−p 3
+ log2 1−u3
u3
= −2
3−1
u1 = 3
= 0.67
4−2
u2 = 3
= 0.67
2−2
u3 = 3
=0
3. 2nd iteration. Compute the Similarity Coefficient between each document and the query
SC(d1 , q) = 3 · lim→0 log2 1−
SC(d2 , q) = −∞ (no query terms)
SC(d3 , q) = lim→0 log2 1−
SC(d4 , q) = lim→0 log2 1−
SC(d5 , q) = 3 · lim→0 log2 1−
Solution:
(a) Compute the number of True-Positives, True-Negatives, False-Positives, False-Negatives
TP = 3
FP = 4
Page 7
FN = 2
T N = 91
(b) Compute Precision, Recall, Balanced F-measure, Accuracy
P = 73
R = 35
F = 21
94
A = 100
8. An IR system produces the following rankings in answer to queries q1 and q2 . The underscored documents
are the ones relevant to the user.
R q1 q2
1 A F
2 L G
3 G D
4 F E
5 D L
6 E I
7 B H
8 H C
9 I B
10 C A
(a) Draw the precision-recall curve and the interpolated precision-recall curve
(b) Compute the Mean Average Precision
(c) Compute the R-precision
(d) Draw the Receiver-Operating-Characteristic
Solution:
(a) Draw the precision-recall curve and the interpolated precision-recall curve
Retrieved documents Pq1 Rq1 Pq2 Rq2
1 1/1 1/5 1/1 1/4
2 1/2 1/5 2/2 2/4
3 2/3 2/5 2/3 2/4
4 2/4 2/5 3/4 3/4
5 2/5 2/5 3/5 3/4
6 3/6 3/5 3/6 3/4
7 4/7 4/5 3/7 3/4
8 5/8 5/5 3/8 3/4
9 5/9 5/5 4/9 4/4
10 5/10 5/5 4/10 4/4
Page 8
Precision-Recall for q1
1
0.9
0.8
0.7
0.6
Precision
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision-Recall for q2
1
0.9
0.8
0.7
0.6
Precision
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
0.67 + 0.80
M AP = = 0.74
2
(c) Compute the R-precision
Rpq1 = 2/5
Rpq2 = 3/4
Page 9
(d) Draw the Receiver-Operating-Characteristic
Retrieved documents F P rq1 T P rq1 F P rq2 T P r q2
1 0/5 1/5 0/6 1/4
2 1/5 1/5 0/6 2/4
3 1/5 2/5 1/6 2/4
4 2/5 2/5 1/6 3/4
5 3/5 2/5 2/6 3/4
6 3/5 3/5 3/6 3/4
7 3/5 4/5 4/6 3/4
8 3/5 5/5 5/6 3/4
9 4/5 5/5 5/6 4/4
10 5/5 5/5 6/6 4/4
ROC for q1
1
0.9
0.8
0.7
True-Positive rate
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False-Positive rate
ROC for q2
1
0.9
0.8
0.7
True-Positive rate
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False-Positive rate
9. Starting from the documents collection of Exercise 1, build an inverted index for the documents collec-
tion.
Page 10
Solution:
term coll. freq. postings list (freq)
afraid 3 3(1) → 4(1) → 5(1)
and 1 1(1)
better 1 2(1)
big 3 1(1) → 2(1) → 4(1)
cat 4 1(1) → 3(1) → 4(1) → 5(1)
dog 5 2(2) → 3(1) → 4(1) → 5(1)
funny 2 1(1) → 5(1)
is 5 1(1) → 2(1) → 3(1) → 4(1) → 5(1)
nice 1 1(1)
not 2 4(1) → 5(1)
of 3 3(1) → 4(1) → 5(1)
small 5 2(1) → 3(2) → 4(1) → 5(1)
than 1 2(1)
Solution:
(a) Compute the Borda’s winner
A=1·1+2·1+4·1=7
B =1·2+2·1=4
C = 2 · 1 + 4 · 2 = 10
D =3·3=9
Borda’s winner is B.
(b) Compute the Condorcet’s winner
B wins on A
A wins on C
A wins on D
B wins on C
B wins on D
D wins on C
Page 11
B A
D C
Condorcet’s winner is B.
(c) Compute the top-2 documents using the MedRank algorithm
1. Sequential access
r1 r2 r3
A B B
B is selected as 1st document
2. Sequential access
r1 r2 r3
A B B
B A C
A is selected as 2nd document
top-2 documents = (B, A)
(d) Compute the top-2 documents using the Fagin’s algorithm
1. Sequential access
r1 r2 r3
A (0.9) B (0.9) B (0.8)
2. Sequential access
r1 r2 r3
A (0.9) B (0.9) B (0.8)
B (0.6) A (0.8) C (0.7)
B visible in all the rankings
3. Sequential access
r1 r2 r3
A (0.9) B (0.9) B (0.8)
B (0.6) A (0.8) C (0.7)
D (0.5) D (0.7) D (0.6)
B, D visible in all the rankings. Compute the scores for all the extracted objects
{A, B, C, D} performing random access when needed.
A = 0.73 (r.a. to r3 )
B = 0.77
C = 0.57 (r.a. to r1 , r2 )
D = 0.60
top-2 documents = (B, A)
(e) Compute the top-2 documents using the Fagin’s threshold algorithm
Page 12
1. Sequential access to r1
r1 r2 r3
A (0.9)
Compute the score of A by making random access to r2 , r3 .
A = 0.73
R = (A)
th = +∞
2. Sequential access to r2
r1 r2 r3
A (0.9) B (0.9)
Compute the score of B by making random access to r1 , r3 .
B = 0.77
R = (B, A)
th = +∞
3. Sequential access to r3
r1 r2 r3
A (0.9) B (0.9) B (0.8)
Score of B already computed.
R = (B, A)
th = +∞
4. Compute the threshold.
th = 0.87
5. Sequential access to r1
r1 r2 r3
A (0.9) B (0.9) B (0.8)
B (0.6)
Score of B already computed.
R = (B, A)
th = 0.87
6. Sequential access to r2
r1 r2 r3
A (0.9) B (0.9) B (0.8)
B (0.6) A (0.8)
Score of A already computed.
R = (B, A)
th = 0.87
7. Sequential access to r2
r1 r2 r3
A (0.9) B (0.9) B (0.8)
B (0.6) A (0.8) C (0.7)
Compute the score of C by making random access to r1 , r2 .
C = 0.57
Page 13
R = (B, A)
th = 0.87
8. Compute the threshold.
th = 0.70
top-2 documents = (B, A)
Solution:
(a) Compute the optimal aggregation using the Kendall-Tau distance
P3
K(r1 , rci ) K(r2 , rci ) K(r3 , rci ) j=1 K(rj , rci )
rc1 A B C 0 1 2 3
rc2 A C B 1 2 3 6
rc3 B A C 1 0 1 2
rc4 B C A 2 1 0 3
rc5 C A B 2 3 2 7
rc6 C B A 3 2 1 6
rc3 = (B, A, C) is the optimal aggregation under the Kendall-Tau distance
(b) Compute the optimal aggregation using the Spearman’s footrule distance
P3
F (r1 , rci ) F (r2 , rci ) F (r3 , rci ) j=1 F (rj , rci )
rc1 A B C 0 2 4 6
rc2 A C B 2 4 4 10
rc3 B A C 2 0 2 4
rc4 B C A 4 2 0 6
rc5 C A B 4 4 4 12
rc6 C B A 4 4 2 10
rc3 = (B, A, C) is the optimal aggregation under the Spearman’s footrule distance
(c) Compute the footrule aggregation using the median rank approximation
µ0 (A) = median{1, 2, 3} = 2
µ0 (B) = median{2, 1, 1} = 1
µ0 (C) = median{3, 3, 2} = 3
(B, A, C) is the footrule aggregation using the median rank approximation.
Page 14
Homework 3
Exercise 18.1; Exercise 18.2; Exercise 18.5; Exercise 18.8; Exercise 18.11
Exercise 21.6; Exercise 21.10; Exercise 21.11
Exercise 13.2; Exercise 13.9
Exercise 14.2; Exercise 14.6;
Exercise 15.2
Exercise 16.3; Exercise 16.13; Exercise 16.17; Exercise 16.20
Exercise 18.1 (0.5’)
What is the rank of the 3 × 3 diagonal matrix below?
1 1 0
0 1 1
1 2 1
Solution:
By applying Gauss elimination, we can get:
1 1 0 1 1 0 1 0 1
0 1 1 → 0 1 1 → 0 1 1 .
1 2 1 0 1 1 0 0 0
6 2
2
4 0
Solving the system of equations, we get x 2 .
Hence, any vector 0 is the corresponding eigenvector.
2
Furthermore, Σ ,,Σ ,.
2
Σ2 =
1.9021 0
0 1.8478
U2 = 0 0.7071
0.0000 0
-0.0000 0
0.0000 0
-0.0000 0
-0.7236 0
-0.2764 0
-0.4472 0
-0.4472 0
0 0.5000
0 0.5000
V2 =
0 0.3827
0 0
0 0
-0.5257 0
-0.8507 0
0 0.9239
C2 =
0.5000 0 0 0 0 1.2071
0 0 0 -0.0000 -0.0000 0
0 0 0 0.0000 0.0000 0
0 0 0 -0.0000 -0.0000 0
0 0 0 0.0000 0.0000 0
0 0 0 0.7236 1.1708 0
0 0 0 0.2764 0.4472 0
0 0 0 0.4472 0.7236 0
0 0 0 0.4472 0.7236 0
0.3536 0 0 0 0 0.8536
0.3536 0 0 0 0 0.8536
3. The (i, j) entry in the matrix CTC represents the number of terms occurring in
both document i and document j.
4. The (i, j) entry in the matrix C2TC2 represents the similarity between document
i and document j in the low dimensional space.
Solution 2:
The matlab code is as follows:
P=[0.020.020.880.020.020.020.02;
0.020.450.450.020.020.020.02;
0.310.020.310.310.020.020.02;
0.020.020.020.450.450.020.02;
0.020.020.020.020.020.020.88;
0.020.020.020.020.020.450.45;
0.020.020.020.310.310.020.31;];
[WD]=eig(P');
x=W(:,1)';
x=x/sum(x);
(ii) For the multinomial model, documents 1 and 2 are identical and they are
different from document 3, because the term London occurs twice in
documents 1 and 2, but occurs once in document 3.
Solution:
(Multinomial model)
i
1
2
2 1 1
|
5 7 4
0 1 1
|
5 7 12
1
̅
2
1 1 1
| ̅
5 7 6
2 1 1
| ̅
5 7 4
ii
1 1 1 1
| 5 ∝ | | ∙ ∙ ∙ 0.0026
2 4 4 12
1 1 1 1
̅| 5 ∝ ̅ | ̅ | ̅ ∙ ∙ ∙ 0.0035
2 6 6 4
So document 5 is not in Class China.
(Bernoulli model)
(iii)
1/2
2 1 3
|
2 2 4
1 1 1
| | |
2 2 2
0 1 1
| | |
2 2 4
1
̅
2
1 1 1
| ̅ | ̅ | ̅
2 2 2
2 1 3
| ̅
2 2 4
0 1 1
| ̅ | ̅ | ̅
2 2 4
iv
| 5
∝ | | 1 | 1 |
1 | 1 | 1 |
1 3 1 1 1 1 1 1
∙ ∙ ∙ 1 ∙ 1 ∙ 1 ∙ 1 ∙ 1 0.0066
2 4 4 2 2 2 4 4
̅| 5
∝ | ̅ | ̅ 1 | ̅ 1 | ̅
1 | ̅ 1 | ̅ 1 | ̅
1 1 3 1 1 1 1 1
∙ ∙ ∙ 1 ∙ 1 ∙ 1 ∙ 1 ∙ 1 0.0198
2 2 4 4 4 4 2 2
So document 5 is not in Class China.
A
Take the above picture as an example. There are 2 classes in the plane, with the left
one being much bigger than the right one. Then a large part of the left circle will be
misclassified like the point C. C is a document belonging to the A class in the training
set, but it is closer to B than A. So it will be labeled as the B class.
Solution:
(i) <a, x> = 4, <b, x> = 16, <c, x> = 28.
So a is most similar to x.
, , ,
(ii) | || |
0.8944, | || |
1, | || |
0.9899
So b is most similar to x.
(iii) d(x, a) = 1.5811, d(x, b) = 2.8284, d(x, c) = 7.2111
So a is most similar to x.
| |
∈
| | 2 , | |
∈ ∈ ∈
1 1
| | 2 , | |
| | | |
∈ ∈ ∈ ∈
1
| | ,
| |
∈ ∈ ∈
1
2
∈ ∈
1
| | | | ,
2
∈ ∈ ∈ ∈
|w | | | ,
∈ ∈ ∈
1 1
| |
|w | 2
∈ ∈
Assignment 2 - Solutions
Solution
We can compute the probability of a document d being in a class c with the following formula:
Q
P (c|d) ∝ P̂ (c) 1≤k≤nd P̂ (tk |c) (1)
Nc Nc 2 1
P̂ (c) = P̂ (c) = = = =
N N 4 2
C(t, c) + 1
P̂ (t|c) = P 0
t0 ∈V C(t , c) + |V |
The vocabulary has 7 terms: Kyoto, Tokyo, Taiwan, Japan, Taipei, Macao, Beijing. There
are 5 tokens in the concatenation of all c documents. There are 5 tokens in the concatenation of
all c documents. Thus, the denominators have the form (5+7). The conditional probabilities for
both classes are then as follows:
Now we can put it all together and compute the class to which the test document will be assigned
using the formula (1):
P̂ (c|d) ∝ 1/2 · (2/12)2 · 3/12 = 1/2 · 12/(12 · 12 · 12) = 1/2 · 12/1728 = 1/288
P̂ (c|d) ∝ 1/2 · (3/12)2 · (1/12) = 1/2 · 9/(12 · 12 · 12) = 1/2 · 9/1728 = 1/384
Solution
We would perform the lookup on the key: ng$s*.
0 1 1 2 2 3 3 4 4 5 5 6 6
1 0 2 2 3 3 4 3 5 5 6 6 7
c
1 2 0 1 1 2 2 3 3 4 4 5 5
2 2 1 0 2 2 3 3 4 3 5 5 6
a
2 3 1 2 0 1 1 2 2 3 3 4 4
3 3 2 2 1 0 2 2 3 3 4 3 5
t
3 4 2 3 1 2 0 1 1 2 2 3 3
After you have calculated the distance between the two strings: Trace the editing operations for one
possible editing path as demonstrated in class:
cost operation input output
1 insert * c
1 insert * a
1 insert * t
0 (copy) c c
0 (copy) a a
0 (copy) t t
Solution
Levenshtein matrix:
r o m n e y
0 1 1 2 2 3 3 4 4 5 5 6 6
1 1 2 1 3 3 4 4 5 5 6 6 7
o
1 2 1 2 1 2 2 3 3 4 4 5 5
2 2 2 2 2 2 3 3 4 4 5 5 6
b
2 3 2 3 2 3 2 3 3 4 4 5 5
3 3 3 3 3 3 3 3 4 4 5 5 6
a
3 4 3 4 3 4 3 4 3 4 4 5 5
4 4 4 4 4 3 4 4 4 4 5 5 6
m
4 5 4 5 4 5 3 4 4 5 4 5 5
5 5 5 5 5 5 4 4 5 5 5 5 6
a
5 6 5 6 5 6 4 5 4 5 5 6 5
• d1 : The European Union Act 2011 prevents additional powers being passed to Brussels without
a referendum
Solution
Y
P (q|d) = [λP (tk |Md ) + (1 − λ)P (tk |Mc )]
1≤k≤|q|
P (q|d1 ) = [0.4 · 1/15 + 0.6 · 1/25] · [0.4 · 1/15 + 0.6 · 1/25] ≈ 0.0512 ≈ 0.0026
P (q|d2 ) = [0.4 · 0/10 + 0.6 · 1/25] · [0.4 · 0/10 + 0.6 · 1/25] = 0.0242 = 0.000576
⇒ Ranking: d1 > d2
Consider the following frequencies for the class coffee for four terms in the first 100,000 documents of
Reuters-RCV1:
a) Which two terms will be selected in frequency-based feature selection and why?
b) Compute the MI values and order the terms according to MI. Which two terms will be selected
in MI-based feature selection?
Solution
a) The terms brazil and producers will be selected in frequency-based feature selection because
they are the most frequent terms in the class coffee.
b) brazil :
council :
producers:
roasted :
(1) brazil
(2) producers
(3) roasted
(4) council
The terms brazil and producers will be selected in MI-based feature selection.
Assignment 1 - Solutions
Which document(s) (if any) match each of the following queries at which positions, where each ex-
pression within quotes is a phrase query? (i) “fools rush in” (ii) “fools rush in” AND “angels fear to
tread”.
Solution
(i) doc2:1, doc4:8, doc7:3,13 (ii) doc4:8 & 12
Solution
See boolean.py in the assignment 1 ex 2 solution.zip file on the course homepage.
Solution
Solution
Processing postings list in order of size (i.e. the shortest postings list first) is usually a good approach.
But it is not optimal e. g. in a conjunctive query with three terms:
term 1 −→ 1 2 3
term 2 −→ 2 3 4 5
term 3 −→ 10 11 20 30 50
As we can see there is no document containing all three query terms. If we would have checked the
first posting of the third list right at the beginning, we would have noticed that there is no intersection
between the first and the third postings list. That would make any further search superfluous.
Assignment 3 - Solutions
Solution
(ii) What is the sequence of postings? (the first entry is the docID of the first document)
Solution
Solution
document query
word tf-raw tf-wght df idf weight n’lized tf-raw tf-wght weight product
digital 1 1 10,000 3 3 0.61 1 1 1 0.61
video 1 1 100,000 2 2 0.40 0 0 0 0
phones 3 1.48 50,000 2.3 3.4 0.69 1 1 1 0.69
√
Length of document vector: 32 + 22 + 3.42 ≈ 4.96
Normalized document term weights: 3/4.96 ≈ 0.61, 2/4.96 ≈ 0.40, 3.4/4.96 ≈ 0.69
•
|Q ∩ D| |{digital, phones}| 2
Jaccard(Q, D) = = =
|Q ∪ D| |{digital, video, phones}| 3
Solution
Equation 9.3:
1 X ~ 1 X
~qm = α~q0 + β dj − γ d~j
|Dr | |Dnr |
d~j ∈Dr d~j ∈Dnr
Assignment 4
Solution
We calculate the Kappa measure by means of the following formula:
P (A) − P (E)
κ=
1 − P (E)
where
Observed proportion of the times the judges agreed P (A) = (120 + 10)/200 = 130/200 = 0.65
Pooled marginals
P (positive) = (160 + 150)/(200 + 200) = 310/400 = 0.775
P (not positive) = (40 + 50)/(200 + 200) = 90/400 = 0.225
Probability that the two judges agreed by chance P (E) = P (not positive)2 + P (positive)2 = 0.2252 +
0.7752 = 0.050625 + 0.600625 = 0.65125
Kappa statistic κ = (P (A) − P (E))/(1 − P (E)) = (0.65 − 0.65125)/(1 − 0.65125) = −0.00125/0.34875
= −0.00358422
The agreement is too low to be a reliable basis for a gold standard.
Assignment 5 - Solutions
~ T xi + b) ≥ 1.
yi (w (1)
Estimate a SVM for the data given below. Find the support vectors and the general form of the
normal vector before solving the equation system 1 for the best matching w.
~ The Chapter 15.1 of the
IR book may help you with this exercise.
5
ut ut ut
ut ut
3
b
b
2
b
1
b b b
0
0 1 2 3 4 5 6
Solution
(1) Correct solution:
5
ut ut ut
ut ut
3
b
b
2
b
1
b b b
0
0 1 2 3 4 5 6
The weight vector is parallel to the shortest line connecting points of the two classes. That is, the line
between ~x1 = (3, 2) and ~x2 = (4.5, 3), giving a weight vector of ~x2 − ~x1 = (1.5, 1). But if we take the
two points ~x1 and ~x2 as support vectors for the classifier margin we see that the point ~x3 = (3, 4.5)
is inside the margin (see second-best solution below). Thus, the decision hyperplane constructed only
from the two points ~x1 and ~x2 would not guarantee the largest margin possible. Therefore we have to
consider the point ~x3 as an additional support vector.
Before we can arrange an equation system to find the best matching w ~ we have to calculate the weight
vector w~ which is perpendicular to the decision hyperplane. Since there are two support vectors from
the triangles’ class (~x2 and ~x3 ) on the top margin line we first compute the vector that connects them:
~x2 − ~x3 = (1.5, −1.5)T . Next, we compute a possible perpendicular vector by means of the dot product
(we know that two vectors are perpendicular if their dot product is 0): 1.5 · 1 + (−1.5) · 1 = 0. Thus,
~ = (1, 1)T . So we already know that the solution is w
w ~ = (1a, 1a) for some a. Using this knowledge
we can arrange a system of equations which considers all three support vectors:
(w
~ · ~x1 + b) = 5a + b = −1
(w
~ · ~x2 + b) = 7.5a + b = 1
(w
~ · ~x3 + b) = 7.5a + b = 1
By solving this system of equations we get the following values for a und b:
4
a=
5
b = −5
ut ut
3
b
b
2
b
1
b b b
0
0 1 2 3 4 5 6
The weight vector is parallel to the shortest line connecting points of the two classes. That is, the
line between ~x1 = (3, 2) and ~x2 = (4.5, 3), giving a weight vector of ~x2 − ~x1 = (1.5, 1). In this (not
completely correct) solution we ignore that there is another point of the triangles’ class that is within
the functional margin.
Thus, we know that w
~ = (1.5a, 1a) for some a. So we get the following system of equations:
(w
~ · ~x1 + b) = −6.5a + b = −1
(w
~ · ~x2 + b) = 9.75a + b = 1
By solving the system of equations we get the following values for a und b:
8
a=
13
b = −5
1. the two centroids are a local optimum; that is, one iteration of reassignment and recomputation
will not change the position of the centroids
(iii) Give two centroids that are better for the 5 points than the ones in (ii). (No need to prove global
optimality, but show they are better than in (ii).)
Solution
7 7
d
6 6
5 5
4 4
3 3
2 2
b
e
1 1
0 0
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
(i)
1. centroids A = (1,1), B = (1,2)
1. assignment: A: a; B: b, c, d
2. centroids A = (1,1) B = (11/3,14/3) = (3.67,4.67)
2. assignment: A: a,b; B: c,d
3. centroids: A: (1,1.5) B: (5,6)
3. assignment A: a,b; B: c,d
4. centroids: A: (1,1.5) B: (5,6)
=⇒ converged
(ii)
e = (6, 1.5), A = (3, 2.375), D = (6, 7)
|A, c|2 = (3 − 4)2 + (2.375 − 5)2 ≈ 7.89
|D, c|2 = (6 − 4)2 + (7 − 5)2 = 8
So c is closer to A than to D. a, b and e are also closer to A than to D. So this is a stable set of two
centroids for the five points.
RSS = sum of squared distances of a, b, c, d, e (where a, b, c, e are assigned to A and d is assigned to
D): ((3−1)2 +(2.375−1)2 )+((3−1)2 +(2.375−2)2 )+7.89+(02 +02 )+((3−6)2 +(2.375−1.5)2 ) ≈ 27.69
(iii)
A = (8/3, 1.5), D = (5, 6) is a better set of centroids than the ones in (ii). Proof:
RSS = sum of squared distances of a, b, c, d, e (where a, b, e are assigned to A and c, d are assigned
to D): ((8/3 − 1)2 + 0.52 ) + ((8/3 − 1)2 + 0.52 ) + (12 + 12 ) + (12 + 12 ) + ((8/3 − 6)2 + 02 ) ≈ 21.17 < 27.69
Solution
If two document are thematically similar they contain similar terms. Even if a frequent term does
not occur in a document the document will be correctly clustered due to other terms common to the
topic.
Solution
The two conditions imply each other. If the assignment does not change then the centroids remain
the same. And if the centroids do not change that means that no reassignment took place.
Assignment 6
V 1 2 3 4 5
U 1 2 3 4 5
1 0.75 -0.29 -0.28 0.00 0.53
1 0.44 -0.30 -0.57 0.58 -0.25
2 0.28 -0.53 0.75 0.00 -0.29
2 0.13 -0.33 0.59 0.00 -0.73
3 0.20 -0.19 -0.45 0.58 -0.63
3 0.48 -0.51 0.37 0.00 0.61
4 0.45 0.63 0.20 0.00 -0.19
4 0.70 0.35 -0.15 -0.58 -0.16
5 0.33 0.22 -0.12 0.58 -0.41
5 0.26 0.65 0.41 0.58 0.09
6 0.12 0.41 0.33 0.58 0.22
1. Calculate the reduced matrix C3 . That is the term-document matrix C reduced to 3 dimensions
(see slides).
2. Compare the rankings of the query “ship ocean” for the matrices C and C3 : Rank the documents
after relevance.
Solution
a)
To calculate C3 we need the matrix Σ with the first three single values (other values are ”zeroed out”):
Σ3 1 2 3 4 5
1 2.16 0.00 0.00 0.00 0.00
2 0.00 1.59 0.00 0.00 0.00
3 0.00 0.00 1.28 0.00 0.00
4 0.00 0.00 0.00 0.00 0.00
5 0.00 0.00 0.00 0.00 0.00
Furthermore V has to be transposed since the formula for singular value decomposition is C = U ΣV T :
VT 1 2 3 4 5 6
1 0.75 0.28 0.20 0.45 0.33 0.12
2 -0.29 -0.53 -0.19 0.63 0.22 0.41
3 -0.28 0.75 -0.45 0.20 -0.12 0.33
4 0.00 0.00 0.58 0.00 0.58 0.58
5 0.53 -0.29 -0.63 -0.19 -0.41 0.22
A 1 2 3 4 5
1 0.950 -0.477 -0.730 0.000 0.000
2 0.281 -0.525 0.755 0.000 0.000
3 1.037 -0.811 0.474 0.000 0.000
4 1.512 0.557 -0.192 0.000 0.000
5 0.562 1.034 0.525 0.000 0.000
Remark:
The false matrix C3 if multiplied with V instead of V T :
C3 d1 d2 d3 d4 d5
ship 0.433 0.116 -0.295 -0.423 1.102
boat 0.215 0.053 -0.812 0.438 -0.174
ocean 0.645 0.039 -1.112 0.275 0.486
wood 1.252 -0.697 0.081 -0.111 0.761
tree 0.816 -0.811 0.382 0.305 -0.333
b)
C : d1 , d2 , d3 or d1 , d3 , d2
C3 : d1 , d2 , d3 , d5 , d4 , d6
Computation:
Solution
d3 slot d4 slot d5 slot
∞ ∞ ∞
d3 d4 d5 ∞ ∞ ∞
s1 0 0 1 h(1) = 4 – ∞ – ∞ 4 4
s2 0 0 1 g(1) = 5 – ∞ – ∞ 5 5
s3 1 1 1 h(2) = 6 – ∞ – ∞ 6 4
s4 0 0 0 g(2) = 2 – ∞ – ∞ 2 2
s5 0 0 1 h(3) = 1 1 1 1 1 1 1
s6 0 0 1 g(3) = 6 6 6 6 6 6 2
s7 1 0 1 h(4) = 3 – 1 – 1 – 1
g(4) = 3 – 6 – 6 – 2
h(5) = 5 – 1 – 1 5 1
Hash functions for the g(5) = 0 – 6 – 6 0 0
permutation: h(6) = 0 – 1 – 1 0 0
h(x) = (2x + 2) mod 7 g(6) = 4 – 6 – 6 4 0
g(x) = (4x + 1) mod 7 h(7) = 2 2 1 – 1 2 0
g(7) = 1 1 1 – 6 1 0
1+0
J(d3 , d4 ) = = 1/2
2
0+0
J(d3 , d5 ) = =0
2
0+0
J(d4 , d5 ) = =0
2
Assignment 7 - Solutions
Solution
ergodic = aperiodic (no periodic behavior) and irreducible (roughly: there is a path from every page
to every other page)
PageRank is well-defined if surfing the web graph is ergodic
q1 q2
q3
For the web graph in the figure, compute PageRank, hub and authority scores for each of the three
pages. Also give the relative ordering of the 3 nodes indicating any ties.
Assume that at each step of the PageRank random walk, we teleport to a random page with probability
0.1, with a uniform distribution over which particular page we teleport to. Normalize the hub and
authority scores so that the maximum hub/authority score is 1.
Hint: Using symmetries to simplify and solving with linear equations might be easier than using iter-
ative methods.
Solution
Since the in-degree of A is 0, the steady-visit rate (or rank) of A is 0.1 · 1/3 = 1/30 (from teleport).
By symmetry, rank(B) = rank(C). Thus, rank(B)=rank(C) = 29/60.
q1 q2 q3
q1 1/3 1/3 1/3
Transition matrix P with teleport:
q2 14/15 1/30 1/30
q3 14/15 1/30 1/30
For initialization: (1/3, 1/3, 1/3):
~xP 1 0.733333 0.133333 0.133333
28 1
(2x) + y = y
30 3
28 2
x− y =0
15 3
28 2
x − (1 − 2x) = 0
15 3
48 2
x=
15 3
=⇒
2 · 15 5
x= = = 0.20833333333333
3 · 48 24
14 7
y =1−2·x= = = 0.5833333333333
24 12
HITS, Solution 1
matrix A matrix AT
0 0 0 0 1 1
1 0 0 0 0 0
1 0 0 0 0 0
matrix AAT matrix AT A
0 0 0 2 0 0
0 1 1 0 0 0
0 1 1 0 0 0
~a = (1 1 1)T
(AT A)~a = (2 0 0)T
(AT A)2~a = (4 0 0)T
(AT A)3~a = (8 0 0)T
~h = (1 1 1)T
(AAT )~h = (0 2 2)
(AAT )2~h = (0 4 4)
(AAT )3~h = (0 8 8)
HITS, Solution 2
Authorities: authority(d2 ) = authority(d3 ) = 0 since nobody is pointing to these two pages. authority(d1 ) >
0 since somebody is pointing to d1 , thus value greater zero. After normalization (there is no page with
a greater authority) this value is 1.0.
Hubs: By similar reasoning: hub(d1 ) = 0, hub(d2 ) = hub(d3 ) > 0.
There is no page with a hub score higher than d2 and d3 , thus hub(d2 ) = hub(d3 ) = 1.
Solution
(qi − wi )2 =
P 2
qi − 2 qi wi + wi2 = 1 − 2 qi wi + 1 = 2(1 − qi wi )
P P P P P
P 2
(Note that for a normalized vector ~x, we have: xi = 1.)
2 2 2 < 2 ⇔ 2(1 −
P P P
Thus: P|~q − ~v | < P
|~
q − w|
~ ⇔
P |~
q − ~
v | < |~
q − w|
~ ⇔ (q i − v i ) (q i − wi ) q i vi ) <
2(1 − qi wi ) ⇔ qi vi > qi wi ⇔ cos(~q , ~v ) > cos(~q, w)
~
This proves that ordering normalized vectors according to increasing distance is the same as ordering
them according to decreasing cosine similarity.
(a) Calculate precision, recall, accuracy and (balanced) f-measure of the presented classifier.
(b) Why do we usually have to face a tradeoff between precision and recall?
Solution
relevant nonrelevant
(a) retrieved 170 30
not retrieved 20 80
(b) Because different users have different needs. Some users want to get documents that match their
query as exact as possible (precision). They do not want to read all the documents that are
available for a certain topic (recall). On the other side there are people who want to obtain all
documents related to a topic, e.g. lawyers who want to get all law documents related to drug
possession. They need high recall.
With respect to an information retrieval system we can always achieve a recall of 1 when we
retrieve the whole collection, but then precision will be very low. When we want high precision,
we can do this by only returning the documents where we are very sure that they are relevant,
in the extreme only 1 document. This will of course create a very low recall. In practice, we will
never have these extreme behaviours, but we nearly always face a decision if we want to increase
precision or recall.
(a) List the classification algorithms we have seen in chapters 13, 14 and 15 and give their key
properties.
(b) Usually, we have dealt with only 2 classes in our examples. What changes with respect to the
classification algorithms in (a) do we need to make if we want to classify more than 2 classes?
(c) Explain the difference between classification and clustering.
Solution
(a) Linear: Naive Bayes [Probabilistic; Independence assumption: One feature is independent from
other features], Rocchio [Calculates Centroids and assigns new documents the class of the nearest
centroid], SVM [Large margin, uses support vectors to calculate a decision hyperplane between
classes]
Non-linear: kNN [decision boundary consists of locally linear segments, no training needed]
(b) If we have e.g. 4 classes: Perform a first binary classification for c1 and {c2 , c3 , c4 }. In the next
step classify c2 and {c3 , c4 } etc.
(c) Classification is supervised, i.e. a classifier is trained on a labeled dataset. Clustering on the
other side is unsupervised: It is carried out on unlabeled data.
This worksheet has three parts: tutorial Questions, followed by some Examples and their Solutions.
• Before your tutorial, work through and attempt all of the Questions in the first section.
• The Examples are there for additional preparation, practice, and revision.
• Use the Solutions to check your answers, and read about possible alternatives.
You must bring your answers to the main questions along to your tutorial. You will need to be
able to show these to your tutor, and may be exchanging them with other students, so it is best to
have them printed out on paper.
If you cannot do some questions, write down what it is that you find challenging and use this to
ask your tutor in the meeting.
Tutorials will not usually cover the Examples, but if you have any questions about those then write
them down and ask your tutor, or go along to InfBASE during the week.
It’s important both for your learning and other students in the group that you come to tutorials
properly prepared. If you have not attempted the main tutorial questions, then you may be sent
away from the tutorial to do them elsewhere.
Some exercise sheets contain material marked with a star ?. These are optional extensions.
Data & Analysis tutorials are not formally assessed, but they are a compulsory and important part
of the course. If you do not do the exercises then you are unlikely to pass the exam.
Attendance at tutorials is obligatory: if you are ill or otherwise unable to attend one week then
email your tutor, and if possible attend another tutorial group in the same week.
Please send any corrections and suggestions to Ian.Stark@ed.ac.uk
Introduction
This tutorial is about Information Retrieval (IR). It deals with two aspects of the information retrieval
task discussed in lectures: evaluating performance of IR systems, and methods for document ranking.
Note that these exercises are running concurrently with the Inf1-DA assignment. If you have questions
or difficulties with that, please ask your tutor about them during the tutorial.
1
(a) Calculate the precision and recall for this system, showing the details of your calculations.
(b) Based on your results from (a), explain what the two measures mean for this scenario. How well
would you say that the hospital’s information IR system works?
(c) According to the precision-recall tradeoff, what will likely happen if an IR system is tuned to
aim for 100% recall?
(d) For the given scenario, which measure do you think is more important, precision or recall? Why?
Given your answer, what value would you give to the weighting factor α when calculating the
F-score measure for the hospital’s IR system?
? (e) Last semester, in Informatics 1: Computation and Logic, you encountered the properties of
soundness and completeness for a logic. Can you relate them to precision and recall of an
IR system?
(a) One possible measure for determining which of the 3 documents is the cosine similarity measure,
which measures the cosine of the angle between the query vector and that of each document.
Compute this measure for each of the three documents.
(b) Based on your results of (a), which document is the best match for this query? Why?
(c) Do you agree with the results of this analysis? What are the strengths and weaknesses of cosine
measure?
2
Examples
This section contains further exercises on information retrieval. All are based on parts of past exam
papers. Following these there is a section presenting solutions and notes on all the examples.
Example 1
(a) What is the information retrieval task ? Give an example of such a task, indicating how it
matches your description.
(b) The performance of an information retrieval system can be evaluated in terms of its precision, P ,
and recall, R. Give an English-language definition of these two terms.
(d) Two retrieval systems, X and Y, are being compared. Both are given the same query, applied
to a collection of 1500 documents. System X returns 400 documents, of which 40 are relevant to
the query. System Y returns 30 documents, of which 15 are relevant to the query. Within the
whole collection there are in fact 50 documents relevant to the query.
Tabulate the results for each system, and compute the precision and recall for both X and Y.
Show your working.
(e) Both precision and recall need to be taken into account when evaluating retrieval systems. Why
is it not sufficient to pick one and use only that?
(f ) The F -score is a measure which combines both measures using a weighting factor α, where high α
means that precision is more important. Give the formula defining the F -score for weighting α.
(h) For the example task you gave in part (a), suggest an appropriate weighting factor α. Justify
your choice.
Example 2
Suppose you wish to find economic reports regarding the impact of oil extraction in the North Sea on
the Scottish economy. A commercial document retrieval service offers the following suggested matches:
the table shows how often some key phrases appear in each report.
North Sea oil Scotland economy
Report A 12 0 3 24
Report B 10 5 20 10
Report C 0 12 9 8
Query 1 1 1 1
Actually obtaining the reports will cost real money, so you would like to select the one most likely to
be relevant. Your task now is to assess this using the cosine similarity measure.
(a) Write out the general formula for calculating the cosine of the angle between two 4-dimensional
vectors (x1 , x2 , x3 , x4 ) and (y1 , y2 , y3 , y4 ).
(b) Use this formula to rank the three documents in order of relevance to the query according to
the cosine similarity measure. What do you think of the results?
3
Solutions to Examples
These are not entirely “model” answers; instead, they indicate a possible solution. Remember that
not all of questions necessarily have a single “right” answer. If you have difficulties with a particular
example, or have trouble following through the solution, please raise this as a question in your tutorial.
Solution 1
(a) The information retrieval task is to find those documents relevant to a user query from among
some large collection of documents.
For example, searching for previous legal rulings relevant to a certain topic from a judicial
archive. The judicial archive is the document collection; the query is some words related to the
topic; and the previous rulings are the relevant documents to be retrieved.
Other examples are possible, of course; but you would still need to identify the document col-
lection, the query, and which documents are relevant.
(b) Precision records what proportion of the documents retrieved do in fact match the query; recall
is the proportion of relevant documents in the collection which are successfully retrieved.
This kind of question is often referred to as “bookwork” — however, even though the required
information can indeed be found in books, it’s still important to be able to explain it clearly in
any given context.
(c) Here are the full names and definitions for the three terms.
Note that this question asks you to both “name” and “define” the values, so it wouldn’t be
enough to say just “True Positives”: you need the definition as well.
(d) “Tabulate” means to exhibit in a table, so this question requires a table showing the results for
each system.
40 15
System X precision P = = 0.1 System Y precision P = = 0.5
400 30
40 15
System X recall R = = 0.8 System Y recall R = = 0.3
50 50
(e) Depending on just one out of precision and recall can lead to extreme but unhelpful solutions.
A system that returns every document indiscriminately has 100% recall; while one that returns
only a single correct document is 100% precise. As information retrieval systems, the first is no
help at all, and the second is not much better.
4
(f ) Here is the formula for F -score in terms of α.
1
Fα =
α P1 + (1 − α) R1
(g) For α = 0.5 the F0.5 -score, or balanced F -score is the harmonic mean of precision and recall.
1 2P R
F0.5 = 1 1 1 1 =
2P + 2R
P +R
(h) For the retrieval of legal judgements, recall is of particular importance (you really don’t want to
miss anything), so value of α below 0.5, say 0.2, might be appropriate.
For other examples, either recall or precision might be more important, depending on the exact
choice of example.
Solution 2
(a) The cosine formula for 4-vectors is:
x1 y1 + x2 y2 + x3 y3 + x4 y4
cos(~x, ~y ) = p 2 p
x1 + x22 + x23 + x24 y12 + y22 + y32 + y42
It’s also possible to give a more compact presentation using vector notation:
~x.~y
cos(~x, ~y ) =
|~x||~y |
although that’s only useful if you are confident in how to then calculate the dot product and
modulus of 4-dimensional vectors.
(b) For the three reports listed, the appropriate calculation is the cosine between each report and
the original query.
12 + 3 + 24 39
cos(Report A, Query) = √ √ = = 0.72
2 2
4 12 + 3 + 24 2 54
10 + 5 + 20 + 10 45
cos(Report B, Query) = √ √ = = 0.90
2 2 2
4 10 + 5 + 20 + 10 2 50
12 + 9 + 8 29
cos(Report C, Query) = √ √ = = 0.85
4 122 + 92 + 82 34
The best fit is where the cosine is largest, closest to 1. This ranks the three documents in order
of similarity to the query as:
• Report B
• Report C
• Report A
These results seem reasonable: Report B is the only document which contains all the keywords;
while Report C does mention oil it doesn’t mention the North Sea specifically; and Report A
doesn’t mention oil at all. The superiority of C over A seems clear in the cosine measure, but I
don’t think it is altogether obvious from simply inspecting the numbers.
Notice that it’s not necessary to take the inverse cosine and compute the actual angle between
the vectors. The questions doesn’t ask for this. However if you did, then the best match would
be the smallest angle, closest to 0.
5
Homework 2
Page 110: Exercise 6.10; Exercise 6.12
Page 116: Exercise 6.15; Exercise 6.17
Page 121: Exercise 6.19
Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24
Page 131: Exercise 7.3; Exercise 7.5; Exercise 7.8
Page 144: Exercise 8.1
Page 145: Exercise 8.3
Page 150: Exercise 8.9
Page 154: Exercise 8.10
Page 167: Exercise 9.3
Page 177: Exercise 9.7
Page 211: Exercise 11.3
Page 228: Exercise 12.6; Exercise 12.7
Exercise 6.10
Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in
Figure 6.9. Compute the tf-idf weights for the terms car, auto, insurance, best, for
eachdocument, using the idf values from Figure 6.8.
Solution
Doc1 Doc2 Doc3
car 44.55 6.6 39.6
Auto 6.24 68.64 0
Insurance 0 53.46 46.98
Best 21 0 25.5
Exercise 6.12
How does the base of the logarithm in (6.7) affect the score calculation in (6.9)? How
does the base of the logarithm affect the relative scores of two documents on a given
query?
Solution
is a constant.
So changing the base affects the score by a factor , and the relative scores of
Exercise 6.15
Recall the tf-idf weights computed in Exercise 6.10. Compute the Euclidean
normalized document vectors for each of the documents, where each vector has four
components, one for each of the four terms.
Solution
doc1 = [0.8974, 0.1257, 0, 0.4230]
doc2 = [0.0756, 0.7867, 0.6127, 0]
doc3 = [0.5953, 0, 0.7062, 0.3833]
Exercise 6.17
With term weights as computed in Exercise 6.15, rank the three documents by
computed score for the query car insurance, for each of the following cases of term
weighting in the query:
1. The weight of a term is 1 if present in the query, 0 otherwise.
2. Euclidean normalized idf.
Solution
1. q = [1, 0, 1, 0]
score(q, doc1)= 0.8974, score(q, doc2) = 0.6883, score(q, doc3) = 1.3015
Ranking: doc3, doc1, doc2
2. q = [0.4778, 0.6024, 0.4692, 0.4344]
score(q, doc1) = 0.6883, score(q, doc2) = 0.7975, score(q, doc3) = 0.7823
Ranking: doc2, doc3, doc1
Exercise 6.19
Compute the vector space similarity between the query “digital cameras” and the
document “digital cameras and video cameras” by filling out the empty columns in
Table 6.1. Assume N = 10,000,000, logarithmic term weighting (wf columns) for
query and document, idf weighting for the query only and cosine normalization for
the document only. Treat and as a stop word. Enter term counts in the tf columns.
What is the final similarity score?
Solution
Word Query document qi*di
tf wf df idf qi=wf-idf tf wf di=normalized
wf
digital 1 1 10,000 3 3 1 1 0.52 1.56
video 0 0 100,000 2 0 1 1 0.52 0
Cameras 1 1 50,000 2.3 2.3 2 1.3 0.68 1.56
Similarity score = 1.56+1.56 = 3.12
Exercise 6.20
Show that for the query affection, the relative ordering of the scores of the three
documents in Figure 6.13 is the reverse of the ordering of the scores for the query
jealous gossip.
Solution:
For the query affection, score(q, SaS) = 0.996, score(q, PaP) = 0.993, score(q, WH) =
0.847, so the order is SaS, PaP, WH.
For the query jealous gossip, score(q, SaS) = 0.104, score(q, PaP) = 0.12, score(q,
WH) = 0.72, so the order is WH, PaP,SaS.
So the latter case is the reverse order of the former case.
Exercise 6.23
Refer to the tf and idf values for four terms and three documents in Exercise 6.10.
Compute the two top scoring documents on the query best car insurancefor each of
the following weighing schemes: (i) nnn.atc; (ii) ntc.atc.
Solution
(i) nnn.atc
nnn weights for documents
Term Doc1 Doc2 Doc3
Car 27 4 24
Auto 3 33 0
Insurance 0 33 29
Best 4 0 17
Query Product
Term tf(augmented) idf tf-idf atc Doc1 Doc2 Doc3
weight
Car 1 1.65 1.65 0.56 15.12 2.24 13.44
Auto 0.5 2.08 1.04 0.353 1.06 11.65 0
Insurance 1 1.62 1.62 0.55 0 18.15 15.95
Best 1 1.5 1.5 0.51 7.14 0 8.67
Score(q, doc1) = 15.12 + 1.06 +0 + 7.14 = 23.32, score(q, doc2) = 2.24 + 11.65 +
18.15 + 0 = 32.04, score(q, doc3) = 13.44 + 0 + 15.95 + 8.67 = 38.06
Ranking: doc3, doc2, doc1
(ii) ntc.atc
ntc weight for doc1
Term tf(augmented) Idf tf-idf Normalized
weights
Car 27 1.65 44.55 0.897
Auto 3 2.08 6.24 0.125
Insurance 0 1.62 0 0
Best 14 1.5 21 0.423
ntc weight for doc2
Term tf(augmented) Idf tf-idf Normalized
weights
Car 4 1.65 6.6 0.075
Auto 33 2.08 68.64 0.786
Insurance 33 1.62 53.46 0.613
Best 0 1.5 0 0
ntc weight for doc3
Term tf(augmented) idf tf-idf Normalized
weights
Car 24 1.65 39.6 0.595
Auto 0 2.08 0 0
Insurance 29 1.62 46.98 0.706
Best 117 1.5 25.5 0.383
query Product
Term tf(augmented) idf tf-idf atc Doc1 Doc2 Doc3
weight
Car 1 1.65 1.65 0.56 0.502 0.042 0.33
Auto 0.5 2.08 1.04 0.353 0.044 0.277 0
Insurance 1 1.62 1.62 0.55 0 0.337 0.38
Best 1 1.5 1.5 0.51 0.216 0 0.19
Score(q, doc1) = 0.762, score(q, doc2) = 0.657, score(q, doc3) = 0.916
Ranking: doc3, doc1, doc2
Exercise 6.24
Suppose that the word coyote does not occur in the collection used in Exercises 6.10
and6.23. How would one compute ntc.atcscores for the query coyote insurance?
Solution
For the ntc weight, we compute the ntc weight of insurance.
For the atc weight, there is no need to compute, because the ntc weight for all
documents is 0 for coyote.
Exercise 7.3
If we were to only have one-term queries, explain why the use of global champion
lists with r = K suffices for identifying the K highest scoring documents. What is a
simple modification to this idea if we were to only have s-term queries for any fixed
integers >1?
Solution
1. We take the union of the champion lists for each of the terms comprising the
query, and restrict cosine computation to only the documents in the union set. If
the query contains only one term, we just take the list with r = K, because there is
no need to compute the union.
2. For each term, identify the highest scoring documents.
Exercise 7.5
Consider again the data of Exercise 6.23 with nnn.atcfor the query-dependent
scoring. Suppose that we were given static quality scores of 1 for Doc1 and 2 for
Doc2.Determine under Equation (7.2) what ranges of static quality score for Doc3
result in it being the first, second or third result for the query best car insurance.
Solution
Suppose the static quality score for Doc3 is g(doc3).
According to Exercise 6.23 and Equation 7.2, score(doc1, q) = 0.7627+ 1 = 1.7627,
score(doc2, q) = 0.6839 + 2 = 2.6839, score(doc3, q) = 0.9211 + g(doc3).
For Doc3 result in being:
(1) the first: 0.9211 + g(doc3) > 2.6839, we get g(doc3) > 1.7628
(2) the second: 1.7627< 0.9211 + g(doc3) < 2.6839, we get 0.8416 < g(doc3) <
1.7628
(3) the third: 0.9211+ g(doc3) <1.7627, we get 0 <=g(doc3) < 0.8416.
Exercise 7.8
The nearest-neighbor problem in the plane is the following: given a set of N data
points on the plane, we preprocess them into some data structure such that, given a
query point Q, we seek the point in N that is closest to Q in Euclidean distance.
Clearly cluster pruning can be used as an approach to the nearest-neighbor problem in
the plane, if we wished to avoid computing the distance from Q to every one of the
query points. Devise a simple example on the plane so that with two leaders, the
answer returned by cluster pruning is incorrect (it is not the data point closest to Q).
Solution
leader leader
closet point query
As is shown in the above picture, the right leader is closer to the query point than the
left leader, but the closest point belongs to the left group.
Exercise 8.1
An IR system returns 8 relevant documents, and 10 nonrelevant documents. There
are a total of 20 relevant documents in the collection. What is the precision of the
system on this search, and what is its recall?
Solution
Precision = 8/18 = 0.44
Recall = 8/20 = 0.4
Exercise 8.3
Derive the equivalence between the two formulas for F measure shown in Equation
(8.5), given that α 1/ β 1 .
1 1 β 1
F
1 β
Exercise 8.9
The following list of Rs and Ns represents relevant (R) and nonrelevant (N) returned
documents in a ranked list of 20 documents retrieved in response to a query from a
collection of 10,000 documents. The top of the ranked list (the document the system
thinks is most likely to be relevant) is on the left of the list. This list shows 6 relevant
documents. Assume that there are 8 relevant documents in total in the collection.
R R N NNNNN R N R N NN R N NNN R
a. What is the precision of the system on the top 20?
b. What is the F1 on the top 20?
c. What is the uninterpolated precision of the system at 25% recall?
d. What is the interpolated precision at 33% recall?
e. Assume that these 20 documents are the complete result set of the system. What
is the MAP for the query?
Assume, now, instead, that the system returned the entire 10,000 documents in a
ranked list, and these are the first 20 results returned.
f. What is the largest possible MAP that this system could have?
g. What is the smallest possible MAP that this system could have?
h. In a set of experiments, only the top 20 results are evaluated by hand. The result
in (e) is used to approximate the range (f)–(g). For this example, how large (in
absolute terms) can the error for the MAP be by calculating (e) instead of (f) and
(g) for this query?
Solution
a. Precision = 6/20 = 0.3
b. Recall = 6/8 = 0.75
2 3
0.43
7
c. 8*0.25 = 2, the uninterpolated precision could be 1, 2/3, 2/4, 2/5, 2/6, 2/7, 1/4
d. Because the highest precision found for any recall level larger than 33% is 4/11 =
0.364, hence the interpolated precision at 33% recall is 4/11 = 0.364.
e. MAP = 1/6 * (1 + 1 + 3/9 + 4/11 + 5/15 + 6/20) = 0.555
f. MAP ∗ 1 1 0.503
One possible condition is as follows:
α 1, ∑ ∈
∑ ∈
, .
No, if β is very small and γ is very large, q might be closer to the centroid of
the relevant documents than q .
Exercise 9.3 (book)
Suppose that a user’s initial query is cheap CDs cheap DVDs extremely cheap CDs.
The user examines two documents, d1 and d2. She judges d1, with the content CDs
cheap software cheap CDs relevant and d2 with content cheap thrills DVDs
nonrelevant. Assume that we are using direct term frequency (with no scaling and no
document frequency). There is no need to length-normalize vectors. Using Rocchio
relevance feedback as in Equation (9.3) what would the revised query vector be after
relevance feedback? Assume alpha= 1, beta= 0.75, gamma= 0.25.
Solution
cheap CDs DVDs extremely software thrills
q0 3 2 1 1 0 0
d1 2 2 0 0 1 0
d2 1 0 1 0 0 1
Using Rocchio algorithm, qm=q0 + 0.75*d1‐0.25*d2. Negative weights are set to 0.
qm 4.25 3.5 0.75 1 0.75 0
Exercise 9.7
If A is simply a Boolean ”incidence” matrix, then what do you get as the entries in
C?
Solution
, is the number of documents containing both term u and term v.
By Boolean co-occurrence matrix, it means the element At,d,term t appears in
document d. As C is the similarity score, we increase Cu,v by one if term u and term v
both appear in the document .
If , , 1, , , 1.
Exercise 11.3
Let Xt be a random variable indicating whether the term t appears in a document.
Suppose we have |R| relevant documents in the document collection and that Xt = 1
ins of the documents. Take the observed data to be just these observations of Xt for
each document in R. Show that the MLE for the parameter pt = P(Xt= 1|R = 1,q),
that is, the value for pt which maximizes the probability of the observed data, is
pt = s/|R|.
Solution
P D|t ∏ ∈ | 1 , where 1| 1, , n is the
number of documents containing the term, and N is the total number of documents.
The log-likelihood: log P(D|t) = n log + (N-n) log
Let the partial derivative equal to 0, 0.
We get , so p /| |.
Exercise 12.6
Consider making a language model from the following training text:
The martian has landed on the latin pop sensation ricky martin
a. Under a MLE-estimated unigram probability model, what are P(the) and
P(martian)?
b. Under a MLE-estimated bigram model, what are P(sensation|pop) and P(pop|the)?
Solution
a. P(the) = 2/11, P(martian) = 1/11
b. P(sensation|pop) = 1, P(pop|the) = 0
Exercise 12.7
Suppose we have a collection that consists of the 4 documents given in the below
table.
docID Document text
1 click go the shears boys click click click
2 click click
3 metal here
4 metal shears click here
Build a query likelihood language model for this document collection. Assume a
mixture model between the documents and the collection, with both weighted at 0.5.
Maximum likelihood estimation (mle) is used to estimate both as unigram models.
Work out the model probabilities of the queries click, shears, and hence click shears
for
each document, and use those probabilities to rank the documents returned by each
query. Fill in these probabilities in the below table:
Query Doc 1 Doc 2 Doc 3 Doc 4
click
shears
click shears
What is the final ranking of the documents for the query click shears?
Solution
Language models
click go the shears boys metal here
model1 1/2 1/8 1/8 1/8 1/8 0 0
model2 1 0 0 0 0 0 0
model3 0 0 0 0 0 1/2 1/2
model4 1/4 0 0 1/4 0 1/4 1/4
Collection 7/16 1/16 1/16 2/16 1/16 2/16 2/16
model
Final ranking for the query click shears: doc4 > doc1 > doc2 > doc3
Web Mining
Exercises
Mauro Brunato, Elisa Cilia
Exercise 1
A corpus contains the following five documents:
d1 To be or not to be, this is the question!
d2 I have a pair of problems for you to solve today.
d3 It’s a long way to Tipperary, it’s a long way to go. . .
d4 I’ve been walking a long way to be here with you today.
d5 I am not able to question these orders.
The indexing system only considers nouns, adjectives, pronouns, adverbs and verbs. All forms are converted to
singular, verbs are converted to the infinitive tense, removes all punctuation marks and translates all letters to
uppercase. Conjunctions, prepositions, articles and exclamations are discarded as well. Multiple occurrences of
the same term within a document are not counted.
For instance, the phrase
Hey, it’s not too late to solve these exercises!
becomes
IT BE NOT TOO LATE SOLVE THIS EXERCISE
1.1) What is the minimum dimension (number of coordinates) of the TFIDF vector space for this collection of
documents?
1.2) Fill the 5 × 5 matrix of Jaccard coefficients between all pairs of documents.
1.3) Apply an agglomerative clustering procedure to the collection. as a measure of similarity between two
clusters D1 and D2 , consider the highest similarity between d1 and d2 , with d1 ∈ D1 and d2 ∈ D2 .
1.4) Draw the resulting dendrogram.
Solution — The stripped-down documents are the following (the third columns count the number of different terms
in each document, just to ease up the calculation of the Jaccard coefficient):
1.1) The collection includes 20 different terms: ABLE, BE, GO, HAVE, I, IT, HERE, LONG, NOT, ORDER, PAIR,
PROBLEM, QUESTION, SOLVE, THIS, TIPPERARY, TODAY, WALK, WAY, and YOU. Therefore, the vector represen-
tation requires at least 20 dimensions.
1.2) The table of Jaccard coefficients is the following. Only the upper triangular part is shown, since the Jaccard
coefficient is symmetrical.
d1 d2 d3 d4 d5
d1 1 0 1/9 1/12 4/7
d2 1 0 1/3 1/13
d3 1 1/4 1/12
d4 1 1/7
d5 1
1.3) The two most similar documents are d1 and d5 , so they can be joined in the same partition. The similarity matrix
becomes:
{d1 , d5 } {d2 } {d3 } {d4 }
{d1 , d5 } 1 1/13 1/9 1/7
{d2 } 1 0 1/3
{d3 } 1 1/4
{d4 } 1
After this step, singletons {d2 } and {d4 } are most similar, and shall be joined:
{d1 , d5 } {d2 , d3 , d4 }
{d1 , d5 } 1 1/7
{d2 , d3 , d4 } 1
Finally, the two remaining clusters can be merged together. The corresponding dendrogram is the following:
1 5 2 4 3
Exercise 2
In the same setting as in the previous exercise, estimate the Jaccard coefficient for all document pairs based on
the application of five random permutations.
Exercise 3
Let D be a set of documents over the set T of terms, ntd counts the number of occurrences of term t in document
d.
3.1) Consider the following term frequency measures:
(
1 if ntd 6= 0 ntd
A1 (t, d) = ntd , A2 (t, d) = A3 (t, d) = , A4 (t, d) = log(1 + ntd ).
0 otherwise, |d|
!−1
X 1 X
B3 (t) = , B4 (d) = A4 (t, d)
1 + A1 (t, d)
d∈D t∈T
Exercise 4
A document retrieval system must be implemented in a structured programming language (Java, C, C++). Doc-
uments and terms are represented with their numeric IDs.
4.1) Define the appropriate array and record structures to efficiently store the matrix ntd counting the number of
occurrences of each term t in each document d, considering that it is very sparse. Define the structure to store
inverse document frequency values.
4.2) Write a function retrieve(q) which, given the array q of term indices, returns an array with the IDs of
the five nearest documents according to the cosine measure in the TFIDF space.
Exercise 5
An information retrieval system manages a corpus of six documents. Given the query q, the system computes
the following probabilities for the documents to be relevant:
i 1 2 3 4 5 6
pi 100% 80% 20% 80% 0 100%
5.1) What strategy can the system adopt in order to maximize its recall score? What strategy can maximize its
precision score?
5.2) Suppose that the only documents that are relevant with respect to query q are 1, 2, 4 and 6 (of course, the
system does not know this). The system implements two alternative algorithms:
1. let document i appear in the returned list iff pi = 100%, or
2. let document i appear in the list with probability pi .
Compute the expected values of precision and recall assigned by the user (who knows the actual document
relevance) to the list of documents returned by each algorithm.
Hint — Note that algorithm (1) is deterministic, only algorithm (2) is stochastic.
Solution —
5.1) Let r = (ri ), where ri is the “true” relevance of document i (remember that the query is fixed). Let x = (xi ),
where xi = 1 iff the IR system returns document i in response to the query. Then,
x·r x·r
Precisionr (x) = 6
, Recallr (x) = 6
.
X X
xi ri
i=1 i=1
In other words, the “precision” of the answer is the amount of relevant documents within the list provided by the IR system.
Its maximum value is attained when all returned documents are relevant, so we need to return only the two documents, 1
and 6, which are certainly relevant to the user. The “recall” of the answer is its property of containing as many relevant
documents as possible, and it is maximized by returning all documents (with the possible exception of 5, which is irrelevant
for sure).
5.2) In the first case, the IR system provides a deterministic answer, having precision 100% and recall 50%. In the
second case, we need to compute precision and recall scores for all possible return strings, and compute their probability-
weighted average:
X X
E(Precision) = Pr(x) Precisionr (x), E(Recall) = Pr(x) Recallr (x).
x x
Note that documents 1 and 6 are always returned, while document 5 is never returned; moreover, documents 2 and 4 are
indistinguishable, so we can determine the following table, where precision (left) and recall (right) scores are provided
together with their probabilities (in parentheses).
x2 + x4
0 1 2
(.04) (.32) (.64)
2 2 3 3 4 4
0 (.8) 2 4 3 4 4 4
(.032) (.256) (.512)
x3 2 2 3 3 4 4
1 (.2) 3 4 4 4 5 4
(.008) (.064) (.128)
Finally,
2 3 4
E(Precision) = .8 + · .008 + · .064 + · .128 ≈ .8 + .005 + .048 + .102 ≈ 96%,
3 4 5
2 3 4
E(Recall) = · .04 + · .32 + · .64 = .02 + .24 + .64 = 90%.
4 4 4
Exercise 6
With the same data of Exercise 5, suppose that the system uses algorithm (1).
6.1) Compute the expected precision and recall scores from the point of view of the IR system, who only knows
the probabilities pi for document i to be relevant.
Solution — In this case the IR system’s answer is known, but the actual document relevance is a random variable with
the given probabilities. Therefore, the average values must be computed against probabilities of the unknown r:
X X
Er (Precision) = Pr(r) Precisionr (x), Er (Recall) = Pr(r) Recallr (x).
r r
We know the answer x of the IR system, which is (1, 0, 0, 0, 0, 1), therefore we can compute a table which is similar to that
of Exercise 5:
x2 + x4
0 1 2
(.04) (.32) (.64)
2 2 2 2 2 2
0 (.8) 2 2 2 3 2 4
(.032) (.256) (.512)
x3 2 2 2 2 2 2
1 (.2) 2 3 2 4 2 5
(.008) (.064) (.128)
Therefore, as expected,
Er (Precision) = 100%
because we are sure that only relevant documents are returned. On the other hand,
2 2 2 2 2 2
Er (Recall) = ·.032+ ·.256+ ·.512+ ·.008+ ·.064+ ·.128 ≈ .032+.171+.256+.005+.032+.051 ≈ 55%.
2 3 4 3 4 5
Exercise 7
Write in your favorite high-level language a function that implements the FastMap algorithm. In particular,
define what input must be provided and which output shall be returned.
Solution — Let matrix d be the input data (mutual distances between couples of items). The matrix is symmetric,
so many optimizations are possible. Let x be the output matrix with one column per document and one row per extracted
coordinate. We assume that the number of documents n and the number of extracted dimensions m are encoded into matrix
sizes; otherwise, we can pass them as two additional integer parameters.
Note that the term within the square root sign at line 10 might be negative, so a bit of care must be taken when actually
implementing the algorithm. . .
Exercise 8
The columns of the following matrix represent the coordinates of a set of documents in a TFIDF space:
2 0 2
1 0 1 0
A= √
6 2 1 2
2 −1 2
Solution —
8.1) Notice that A has two linearly dependent (actually equal) columns (thus rk A < 3), while the first two columns are
independent (thus rk A ≥ 2), therefore rk A = 2.
8.2) Similarities are computed by dot products, let’s do it in a single shot for all documents:
0 1
−2
1
AT q = √ @ 5 A ;
6 −2
Similarity to the documents is computed via the V Σ2 matrix. If all computations are right,
V Σ2 q̂ = AT q.
Exercise 9
Specify in the MapReduce framework the Map and Reduce functions to find the number of occurrences of
one/more given pattern/s in a collection of documents.
Function Map receives a key (related to the document ID or line offset), which we can disregard, and a sequence of terms
(a line or a full document). It gives as output a list of pairs (match, 1), one for each match of the pattern in the received
value.
Function Reduce takes as input a pair (match, [n1 , . . . , nk ]) where the value part is a list of previously computed
occurrences (originally all 1’s) and returns the list of matching patterns (only one element in this case) with the number of
occurrences for each match.
The pseudo-code for the Map and Reduce functions is the following:
d1 , d2 , d3 , d4 , d5 , d6
Solution — 10.1) We are given a ranked list of documents returned in response to a query q with their associated relevance
values. In this ranked retrieval context, precision and recall can be computed by considering as the set of retrieved documents
the top k ranked documents:
k rk R P
0 0 0 1
1 0 0 0
2 1 0.25 0.5
3 1 0.5 0.66
4 0 0.5 0.5
5 1 0.75 0.6
6 1 1 0.66
The interpolated precision Pinterp at a certain level ρ of recall is defined as the highest precision found for any recall level
ρ′ ≥ ρ:
Thus the interpolated precision at level ρ = 0.5 of recall is Pinterp (0.5) = 0.66
10.2) The F1 − measure = 2×P ×R
P +R
when all the documents are returned is F1 = 0.795 (by taking the P and R values
computed in the last row of the table).
10.3) To find the BEP we plot the interpolated precision curve:
0.8
BEP=0.66
0.6
Precision
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Recall
Exercise 11
The singular value decomposition of a term-document matrix A = U ΣV T is
1 0 −1
1 −1 0 3 0 0
1 1 1 1 1
U = √ 1 1 √0 , Σ = 0 2 0 , V =√
2 0 0 2 0 0 1 3 0 1 −1
1 −1 0
11.1) What is the rank of the matrix A?
11.2) Perform a reduction of the LSI space by one dimension.
11.3) Given the new representation of matrix Â, apply an agglomerative clustering procedure to the collection.
Merge the clusters at each step according to the self-similarity measure by using as a measure of inter-document
similarity simply the dot-product hd1 , d2 i.
11.4) Draw the resulting dendrogram. How many clusters can you find at a level of similarity of 2?
11.5) Check the clustering results you get by cutting across the dendrogram, by plotting them.
Solution —
11.1) Matrix A was originally a 3 × 4 matrix. The three elements in the diagonal matrix Σ are non-null, therefore matrix
AT A (hence, matrix A) has full rank (3).
11.2) Let us obtain Û , V̂ and Σ̂ by removing the third column from U and V , and the third row and column from Σ,
corresponding to the smallest eigenvalue of AT A:
0 1
1 −1 „ « „ «
1 @ 3 0 1 1 1 0 1
Û = √ 1 1 A, Σ̂ = , V̂ = √ .
2 0 0 2 3 0 1 1 −1
0
2 3 4
1 3 0 3
4 5
2 3 3
3 − 43
11.4) {1, 2} and {1, 4} are both candidates as the first cluster. Let us choose the first pair. Therefore, at level 3 the first
clustering step yields
3 4
13 23
{1, 2} 9 9
3 − 34
23
Now, the highest self-similarity value is achieved by cluster {1, 2, 4} at level 9
, so that the similarity matrix becomes
3
23
{1, 2, 4} 18
23/9
23/18
Exercise 12
Consider the query: “love Mary” and the set of returned documents of a information retrieval system:
d1: John gives a book to Mary.
d2: John who reads a book loves Mary.
d3: Whom does John think Mary loves?
d4: John thinks a book is a good gift.
12.1) Define the set of keywords and give documents the corresponding representation, after applying the stop
word elimination and the stemming processes. Assume that we are using direct term frequency (with no scaling
and no document frequency). Do not normalize vectors.
12.2) Suppose that document 2 has been judged as relevant, and document 4 as not-relevant. Using Rocchio
relevance feedback what would the revised query vector be after relevance feedback?
Assume α = 1, β = 0.5, γ = 0.5.
Exercise 13
A large set of documents, each containing a large number of terms, is given. The aim of this exercise is to create
an index that maps each term to the document where it occurs in the earliest position (ties may be broken at
will). For example, given the three following documents:
Filename Content
random.doc Zigzag bumblebee slash acorn
nonsense.txt Bumblebee acorn zigzag slash
useless.pdf Acorn dot bumblebee slash zigzag
the index should be:
acorn 7→ useless.pdf
bumblebee 7→ nonsense.txt
dot →
7 useless.pdf
slash → 7 random.doc
zigzag 7→ random.doc
In fact, the word “bumblebee” appears in position 2 of file random.doc, in position 1 of file nonsense.txt
and in position 3 of file useless.pdf, therefore it is mapped to nonsense.txt.
13.1) Outline a MapReduce-based solution to the problem. In particular, describe the input and output records
of the mapper and reducer functions.
13.2) Could a combiner be useful? Provide a short motivation to your answer.
13.3) Implement the mapper and the reducer; assume that the data are already tokenized and use any language
or high-level pseudo-code.
Exercise 14
A directed graph is a pair G = (V, E), where V , the vertex set is a finite set of terms, and E ⊆ V × V is the
edge set. The indegree of a vertex v ∈ V is the number of incoming edges (the cardinality of E ∩ (V × {v})),
while its outdegree is the number of outgoing edges (the cardinality of E ∩ ({v} × V )).
An edge in E can be represented by a line of text containing two terms (the first is the origin, the second the
destination of the edge). The edge set E is therefore represented by a collection of lines of text.
Given a collection of lines describing the edge set E (either coming from a single file or split among several
files), we want to design a MapReduce system to produce a list of vertices, each associated with a pair of integers
representing their indegree and outdegree.
For example, consider the set of lines on the left. The resulting mapping is shown on the right.
lorem 7→ (0, 2)
lorem ipsum dolor sit ipsum 7→ (2, 1)
amet ipsum lorem dolor ⇒ dolor 7→ (1, 2)
ipsum sit dolor amet sit 7→ (2, 0)
amet 7→ (1, 1)
lorem dolor
amet sit
14.1) What are the domain and codomain of the Map and Reduce functions? Is it possible to use the Reduce
function as a combiner as well?
14.2) Write a pseudo-code implementation of the relevant functions.
Exercise 15
Below is a table showing how a human judge rated the relevance of a set of 15 documents with respect to a
particular information need (0 = nonrelevant, 1 = relevant).
Let us assume that two different information retrieval engines (S1 and S2) compute for this query the following
rankings:
S1: (5, 8, 9, 1, 3, 4, 2, 10, 12, 13, 15, 6, 7, 14, 11) and S2: (7, 8, 1, 10, 12, 2, 3, 5, 13, 15, 14, 4, 6, 9, 11).
15.1) Does intuition suggest that one of the two IR system is better than the other?
Show that your guess is right by analysing the performance of the two systems and by comparing them.
Use the most suitable performance measures and methods among those we have seen during the course, giving
as much evidence as you can.
15.2) Suppose the IR engine can return only the first 8 documents to the user. Compare the performance of the
systems in this case.
Which IR system is the best? Justify your answer.
Exercise 16
Consider the following set of documents, where the vocabulary is composed of three words and we have two
categories A and B:
(5, 6, 0) 7→ A
(2, 1, 3) 7 → A
(7, 7, 0) 7 → A
(2, 2, 5) 7 → A
(0, 8, 4) 7 → B
(2, 0, 8) 7 → B
(7, 1, 3) 7 → B
16.1) Perform one iteration of the k-means algorithm assuming that the initial clustering corresponds to the
provided categorization. Show the final clustering.
16.2) Suppose that the document set is used as a training set for a supervised k-Nearest-Neighbors classifier.
Given the new document (4, 4, 1), how would the classifier categorize it for k = 1? What for k = 3?
Exercise 17
Below is a table showing how two human judges rated the relevance of a set of 12 documents to a particular
information need (0 = nonrelevant, 1 = relevant).
Let us assume that you have written an information retrieval engine that for this query returns the set of docu-
ments {4, 5, 6, 7, 8}.
17.1) Calculate the precision and recall of your system if a document is considered relevant only if both judges
agree.
17.2) Calculate the precision and recall of your system if a document is considered relevant if either judge thinks
it is relevant.
17.3) Suppose the documents are returned by your IR engine in the ID order as in the table.
a. Plot the Precision versus Recall graph for the first case (a document is considered relevant only if both
judges agree) when varying the number of documents returned (1 document returned, 2 documents re-
turned, etc).
b. Determine the interpolated precision at level ρ = 0.5 of recall.
c. How many documents should the system return in order to maximize its performance? Justify your
answer.
Exercise 18
The network of references for a set of five hypertexts is given in figure:
1 3 4
2 5
Compute the first 5 iterations of the PageRank and HITS algorithms in the following hypotheses:
• No damping factor.
• Initial PageRank vector gives probability 1 to node 1.
• Initial hub and authority vectors are uniformly 1 over all nodes.
• No normalization required.
Exercise 19
The network of references for a set of four hypertexts is given in figure:
1 4
2 3
19.1) Execute the first four steps of the PageRank algorithm starting from user being with certainty at node 1
(no damping factor).
19.2) Compute the stationary PageRank scores of the documents.
Exercise 20
Suppose that a query, executed on the same network as Exercise 19, returns nodes 1 and 2, as the root set and
that we want to use the HITS algorithm in order to rank the pages.
20.1) Define the expanded set and the base set for the given query.
20.2) Compute the first five iterations of the HITS algorithm for the base set.
20.3) Which hub and authority values will asymptotically dominate?
Exercise 21
Let a hypertext system be a complete bipartite graph with 3 hubs and 2 authorities.
21.1) For every node in the system, draw a link from the node to itself. Write the adjacency matrix of the system,
and normalize it for use with the PageRank algorithm.
21.2) What is the PageRank score of the nodes in the system? Provide both an analytical proof and an intuitive
explanation. Assume no damping factor.
21.3) Now add a link from one of the authorities to one of the hubs. What is the PageRank score of the nodes
now? Provide both an analytical proof and an intuitive explanation.
LT v = v
corresponding to the following adjacency matrix (on the right, the normalized version):
0 1 0 1
1 0 0 1 1 1/3 0 0 1/3 1/3
B0 1 0 1 1C B 0 1/3 0 1/3 1/3C
E′ = B L′ = B
B C B C
B0 0 1 1 1C B 0 0 1/3 1/3 1/3C
C
C
@1 0 0 1 0A @1/2 0 0 1/2 0 A
0 0 0 0 1 0 0 0 0 1
Exercise 22
Let V1 and V2 be two finite sets. Then the set of edges in a complete directed bipartite graph having V1 as source
nodes and V2 as destination nodes is the Cartesian product of the two sets:
V1 × V2 = {(i, j) : i ∈ V1 ∧ j ∈ V2 }
V = {1, . . . , 12}
E = {1, 2, 3} × {4, 5, 6} ∪ {5, 6} × {7, 8} ∪ {9, 10} × {11, 12} .
Note that the three components are not disjoint, but the graph is not connected.
For every node n, according to the HITS scoring system, let h(n) be its hub score and let a(n) be its authority
score. Moreover, if B = (VB , EB ) is a bipartite graph, its importance I(B) as the sum of hub scores of its
source nodes plus the sum of the authority scores of its destination nodes:
X X
I(B) = h(i) + a(i).
i:∃j(i,j)∈EB i:∃j(j,i)∈EB
22.1) Which bipartite component (among G1 , G2 and G3 ) will asymptotically achieve the maximum impor-
tance, and why?
22.2) Simulate three iterations of the HITS system starting with a uniform value of 1 to all hub and authority
scores. What is the importance of each bipartite component, at the end?
22.3) If the edge (3, 9) is added to G, how do you expect the importance scores of the three components to
change, and why?
22.4) With the further addition of edge (10, 3) to the graph, how do you expect the importance scores of the
three components to change, and why?
Solution — The initial graph is the following (the three bipartite components are also shown):
G1
1 4
G2
2 5 7
3 6 8
G3
9 11
10 12
22.1) The HITS ranking system favors the largest bipartite component, which corresponds to the principal eigenvector
of E T E. Therefore, component G1 will asymptotically prevail.
22.2) Authority scores:
Node 1 2 3 4 5 6 7 8 9 10 11 12
Initial value 1 1 1 1 1 1 1 1 1 1 1 1
Step 1 0 0 0 3 3 3 2 2 0 0 2 2
Step 2 0 0 0 9 9 9 4 4 0 0 4 4
Step 3 0 0 0 27 27 27 8 8 0 0 8 8
Hub scores:
Node 1 2 3 4 5 6 7 8 9 10 11 12
Initial value 1 1 1 1 1 1 1 1 1 1 1 1
Step 1 3 3 3 0 2 2 0 0 2 2 0 0
Step 2 9 9 9 0 4 4 0 0 4 4 0 0
Step 3 27 27 27 0 8 8 0 0 8 8 0 0
Hub scores:
Component G1 G2 G3
Initial value 6 4 4
Step 1 18 8 8
Step 2 54 16 16
Step 3 162 32 32
22.3) After the new edge, the graph is the following:
1 4
2 5 7
3 6 8
9 11
10 12
Due to the current hub value of node 3, the authority value of node 9 increases, and in turn also the hub value of node 3 will
increase, and therefore the authority values of nodes 4, 5, and 6. Therefore, the new edge causes I(G1 ) to increase. On the
other hand, the authority value of node 9 does not impact on I(G3 ), where it is a source, and for the same reason the new
edge has no impact on I(G2 ).
22.4) Finally, after the addition of the last edge:
1 4
2 5 7
3 6 8
9 11
10 12
The edge impacts on the authority score of node 3, therefore I(G1 ) and I(G2 ) do not change, and the hub
score of node 10 (and hence the authorities of nodes 11 and 12) is increased. Therefore, the new edge only
impacts on I(G3 ).
Exercise 23
A set of four web pages (A, B, C and D) is completely connected: all pages contain links to every other page,
while no page contains links to itself.
23.1) Compute the PageRank score of all pages.
23.2) Now add web page E, and two links: one from C to E, the second from E to D (so that E has exactly an
incoming link and an outgoing link). Compute the PageRank score of all pages.
Exercise 24
Consider a document corpus with m = 6 documents, n = 5 terms. Suppose that documents have been clustered
into m′ = 2 clusters and terms have been clustered into n′ = 2 clusters. The following document-term matrix
and cluster attribution has been determined:
1 2 3 4 5
1 1 2 1 2
1 1 X
2 2 X X
3 1 X X X
4 1 X X
5 2 X X
6 2 X X
24.1) Consider the Jaccard index as similarity measure. Suppose that all we know about a document is that it
contains term 2. Which other term is most likely to occur in the same document?
24.2) Compute the following probabilities for all suitable index values:
• the probability pi′ that a random document belongs to cluster i′ ;
• the probability pj ′ that a random item belongs to cluster j ′ ;
• the probability pi′ j ′ that a document in cluster i′ contains a term in cluster j ′ .
24.3) Perform a step of the Gibbs Sampling technique on document 4 by computing the posterior probabilities
π4→i′ for i′ = 1, 2. Was the proposed cluster attribution likely, or will it be probably changed?
Exercise 25
Given the following three documents (each row is a document and each cell corresponds to a term and contains
its term id)
1 1 2 1 5 2 2
2 4 3 3 1 2 1
3 2 2 5 4 3 3
assume a multinomial model for the document generation and estimate the parameters of the term distribution by
using the maximum likelihood estimation method. (Show all the steps to obtain the best parameter estimation)
As all the documents have the same length, assume P (L = ld |Θ) = 1 in the multinomial model
P (ld , n(d, t)|Θ).
Exercise 26
Solve the previous exercise by using the least squares method. (Show all the steps to obtain the best parameter
estimation)
Exercise 27
Given the following three documents (each row is a document and each cell corresponds to a term and contains
its term id)
1 1 2 1 5 2 2 3 2
2 1 3 1 5 2 2
3 2 2 5 4 3 3 2
assume a multinomial model for the document generation and estimate the parameters of the term distribution
by using the least square method. (Show all the steps to obtain the best parameter estimation)
Exercise 28
A document set has been partitioned into two clusters. For each cluster, 100 2-shingles have been sampled
randomly. Shingles were divided into four categories: “2, 4” (term 2 followed by term 4), “2, 4̄” (term 2 followed
by any term different from 4), “2̄, 4” (any term different from 2 followed by term 4) and “2̄, 4̄”.
The following tables show the results of our sampling in clusters 1 and 2 respectively:
C1 4 4̄ C2 4 4̄
2 20 10 2 10 10
2̄ 10 60 2̄ 0 80
28.1) Based on the above contingency tables, what are the relative frequencies of terms 2 and 4 within the two
clusters? What are the relative frequencies of terms different from 2 and 4?
28.2) Consider the following term-based generative model for documents:
1. Choose the cluster by an unbiased coin throw.
2. Document length is always 6.
3. Choose every term of the document independently with probability equal to the frequency of the term
within the chosen cluster.
(Hint: the model depends on four free parameters, i.e., probability of term 2 and probability of term 4 within
each cluster).
Use this model to determine the maximum-likelihood clustering of the three following documents:
• d1 = 1, 2, 4, 2, 3, 5
• d2 = 3, 2, 1, 3, 5, 4
• d3 = 1, 2, 4, 2, 4, 2
28.3) Use a similar generative model based on shingles instead of terms, considering every shingle as indepen-
dent of the others (so that every document is made of 5 independent shingles).
Use this model to determine the maximum-likelihood clustering of the same documents.
Nota bene: this is not an exercise about parameter estimation. Parameters are already given, only document
attribution to clusters must be decided.
Solution — 28.1) We just count the frequency of shingles containing term 2 in the two clusters and divide by the
total number of samples. In cluster 1, for example, 30 samples out of 100 contain term 2. We do similarly for term 4, then
compute the remaining probability:
28.2) We must compute the probability for each document to be generated within each cluster. Since every term is
generated independently, the probability of the document is just the probability of each term being selected in its position.
Let us define as pij the probability for document i to be generated in cluster j. For example:
p21 = .4 × .3 × .4 × .4 × .4 × .3 = .002304
Probabilities are:
C1 C2
d1 .001728 .001372
d2 .002304 .004802
d3 .000972 .000056
Therefore, documents d1 and d3 are attributed to cluster C1 , document d2 to cluster C2 .
28.3) Same computation with shingles:
C1 C2
d1 .00432 .00512
d2 .00216 0
d3 .00864 .00512
Notice that the probability that d2 is generated in cluster C2 is null because it contains the shingle “2̄, 4”. Therefore,
documents d2 and d3 are attributed to cluster C1 , while d1 is attributed to cluster C2 .
Exercise 29
As a part of a clustering method, we decide to compute for each document d the number of occurrences of the
most frequent term in that document:
N (d) = max n(t, d).
t∈T
We want to model N (d) as a Poisson random variable with parameter λ, so that given a random document d in
our corpus we have
λn e−λ
Pr(N (d) = n; λ) =
n!
29.1) Given a sampled document set d1 , . . . , dk , show that the maximum likelihood estimate of λ in the Poisson
model is the average of N (di ).
29.2) Determine λ based on the following sample:
• d1 = (1, 2, 3, 4, 3, 2, 3, 3, 2, 1, 5, 6, 3)
• d2 = (3, 2, 4, 4, 2, 3, 2, 4, 5, 1, 6)
• d3 = (6, 4, 3, 5, 2, 6, 1, 7)
• d4 = (6, 5, 4, 3, 2, 1, 2, 6, 2, 2)
• d5 = (4, 3, 4, 2, 5, 4, 1, 6, 3)
Solution —
29.1) The likelihood of λ with respect to the sample set is
k k
Y Y λN(di ) e−λ
L(λ; N (d1 ), . . . , N (dk )) = Pr(N (d1 ), . . . , N (dk ); λ) = Pr(N (di ); λ) = .
i=1 i=1
N (di )!