You are on page 1of 82

Information Retrieval - Exercises

1. Consider the following documents collection


d1 = “Big cats are nice and funny”
d2 = “Small dogs are better than big dogs”
d3 = “Small cats are afraid of small dogs”
d4 = “Big cats are not afraid of small dogs”
d5 = “Funny cats are not afraid of small dogs”
(a) Compute the tokens for each document
(b) Normalize the tokens with respect to plurals and upper/lower case
(c) Compute the dictionary relative to the documents collection

Solution:
(a) Compute the tokens for each document
d1 = “Big—cats—are—nice—and—funny”
d2 = “Small—dogs—are—better—than—big—dogs”
d3 = “Small—cats—are—afraid—of—small—dogs”
d4 = “Big—cats—are—not—afraid—of—small—dogs”
d5 = “Funny—cats—are—not—afraid—of—small—dogs”
(b) Normalize the tokens with respect to plurals and upper/lower case
d1 = “big—cat—is—nice—and—funny”
d2 = “small—dog—is—better—than—big—dog”
d3 = “small—cat—is—afraid—of—small—dog”
d4 = “big—cat—is—not—afraid—of—small—dog”
d5 = “funny—cat—is—not—afraid—of—small—dog”
(c) Compute the dictionary relative to the documents collection
Dictionary = {big,cat,is,nice,and,funny,small,dog,better,than,afraid,of,not}

2. Starting from the documents collection of Exercise 1, build the documents-terms incidence matrix as
required by the Boolean model.

1
better

afraid
funny

small

than
nice

and

dog

not
big

cat

of
is
d1 1 1 1 1 1 1 0 0 0 0 0 0 0
Solution: d2 1 0 1 0 0 0 1 1 1 1 0 0 0
d3 0 1 1 0 0 0 1 1 0 0 1 1 0
d4 1 1 1 0 0 0 1 1 0 0 1 1 1
d5 0 1 1 0 0 1 1 1 0 0 1 1 1

3. Starting from the documents collection of Exercise 1 consider a Boolean model.


(a) Aswer the query q1 = funny AND dog
(b) Aswer the query q2 = nice OR dog
(c) Aswer the query q3 = big AND dog AND NOT funny
(d) Translate query q3 into a Disjunctive Normal Form considering a
dictionary = {big,cat,funny,small,dog}

Solution:
(a) Aswer the query q1 = funny AND dog
Rfunny = {d1 , d5 }
Rdog = {d2 , d3 , d4 , d5 }
q1 → Rfunny ∩ Rdog = {d5 }
(b) Aswer the query q2 = nice OR dog
Rnice = {d1 }
Rdog = {d2 , d3 , d4 , d5 }
q2 → Rnice ∪ Rdog = {d1 , d2 , d3 , d4 , d5 }
(c) Aswer the query q3 = big AND dog AND NOT funny
Rbig = {d1 , d2 , d4 } Rdog = {d2 , d3 , d4 , d5 } Rfunny = {d1 , d5 }

C
q3 → (Rbig ∩ Rdog ) ∩ Rfunny = {d2 , d4 } ∩ {d2 , d3 , d4 } = {d2 , d4 }
(d) Translate query q3 into a Disjunctive Normal Form considering a
dictionary = {big,cat,funny,small,dog}
funny

small

dog
big

cat

1 0 0 0 1
1 0 0 1 1
1 1 0 0 1
1 1 0 1 1

q3 = (big ∧ ¬cat ∧ ¬f unny ∧ ¬small ∧ dog)∨


(big ∧ ¬cat ∧ ¬f unny ∧ small ∧ dog)∨
(big ∧ cat ∧ ¬f unny ∧ ¬small ∧ dog)∨
(big ∧ cat ∧ ¬f unny ∧ small ∧ dog)

Page 2
4. Starting from the documents collection of Exercise 1, build the documents-terms weights matrix using
as term-frequency:
(a) the number of occurrences of the term in each document
(b) the normalized number of occurrences
(c) the logarithmic number of occurrences

Solution:
Compute the inverse document frequency for each term
ti idfi
big log2 (5/3)
cat log2 (5/4)
is log2 (5/5)
nice log2 (5/1)
and log2 (5/1)
funny log2 (5/2)
small log2 (5/4)
dog log2 (5/4)
better log2 (5/1)
than log2 (5/1)
afraid log2 (5/3)
of log2 (5/3)
not log2 (5/2)
Since the term “is” appears in all the documents we can safely ignore it from here on.
Compute the number of occurrences of each term in each document
better

afraid
funny

small

than
nice

and

dog

not
big

cat

of

d1 1 1 1 1 1 0 0 0 0 0 0 0
d2 1 0 0 0 0 1 2 1 1 0 0 0
d3 0 1 0 0 0 2 1 0 0 1 1 0
d4 1 1 0 0 0 1 1 0 0 1 1 1
d5 0 1 0 0 1 1 1 0 0 1 1 1
Remember that wi,j = tfi,j · idfi
Compute the documents-terms weights matrix using as term-frequency:
(a) the number of occurrences of the term in each document [tfi,j = f reqi,j ]
funny

small
nice

and
big

cat

d1 log2 (5/3) log2 (5/4) log2 (5/1) log2 (5/1) log2 (5/2) 0
d2 log2 (5/3) 0 0 0 0 log2 (5/4)
d3 0 log2 (5/4) 0 0 0 2 log2 (5/4)
d4 log2 (5/3) log2 (5/4) 0 0 0 log2 (5/4)
d5 0 log2 (5/4) 0 0 log2 (5/2) log2 (5/4)

Page 3
better

afraid
than
dog

not
of
d1 0 0 0 0 0 0
d2 2 log2 (5/4) log2 (5/1) log2 (5/1) 0 0 0
d3 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) 0
d4 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)
d5 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)
f reqi,j
(b) the normalized number of occurrences [tfi,j = maxi f reqi,j
]
For each document compute the maximum number of occurrences among all the terms
dj maxi f reqi,j
d1 1
d2 2
d3 2
d4 1
d5 1
Compute the documents-terms weights matrix

funny

small
nice

and
big

cat

d1 log2 (5/3) log2 (5/4) log2 (5/1) log2 (5/1) log2 (5/2) 0
d2 0.5 log2 (5/3) 0 0 0 0 0.5 log2 (5/4)
d3 0 0.5 log2 (5/4) 0 0 0 log2 (5/4)
d4 log2 (5/3) log2 (5/4) 0 0 0 log2 (5/4)
d5 0 log2 (5/4) 0 0 log2 (5/2) log2 (5/4)
better

afraid
than
dog

not
of

d1 0 0 0 0 0 0
d2 log2 (5/4) 0.5 log2 (5/1) 0.5 log2 (5/1) 0 0 0
d3 0.5 log2 (5/4) 0 0 0.5 log2 (5/3) 0.5 log2 (5/3) 0
d4 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)
d5 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)
(c) the logarithmic number of occurrences [tfi,j = 1 + log2 f reqi,j ]
Please note that
1 + log2 1 = 1 + 0 = 1
1 + log2 2 = 1 + 1 = 2
funny

small
nice

and
big

cat

d1 log2 (5/3) log2 (5/4) log2 (5/1) log2 (5/1) log2 (5/2) 0
d2 log2 (5/3) 0 0 0 0 log2 (5/4)
d3 0 log2 (5/4) 0 0 0 2 log2 (5/4)
d4 log2 (5/3) log2 (5/4) 0 0 0 log2 (5/4)
d5 0 log2 (5/4) 0 0 log2 (5/2) log2 (5/4)

Page 4
better

afraid
than
dog

not
of
d1 0 0 0 0 0 0
d2 2 log2 (5/4) log2 (5/1) log2 (5/1) 0 0 0
d3 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) 0
d4 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)
d5 log2 (5/4) 0 0 log2 (5/3) log2 (5/3) log2 (5/2)

5. Starting from the documents collection of Exercise 1, rank the documents with respect to query q =
{big—cat—afraid} using the normalized term-frequency model (Exercise 4b).
(a) Use the Eucledian distance
(b) Use Cosine similarity
(c) Use Jaccard similarity

Solution:
Recall the normalized term-document weights matrix
better

afraid
funny

small

than
nice

and

dog

not
big

cat

of
d1 0.74 0.32 2.32 2.32 1.32 0 0 0 0 0 0 0
d2 0.37 0 0 0 0 0.16 0.32 1.16 1.16 0 0 0
d3 0 0.16 0 0 0 0.32 0.16 0 0 0.37 0.37 0
d4 0.74 0.32 0 0 0 0.32 0.32 0 0 0.74 0.74 1.32
d5 0 0.32 0 0 1.32 0.32 0.32 0 0 0.74 0.74 1.32
Compute the query vector
better

afraid
funny

small

than
nice

and

dog

not
big

cat

of

q 0.74 0.32 0 0 0 0 0 0 0 0.74 0 0


(a) Use the Euclidean distance
Compute the Euclidean distance and the Similarity Coefficient between each document and
the query. (Note: at the exam chose only one of the similarity coefficient definitions)
SC1 (q, dj ) = e−||q−dj ||2
1
SC2 (q, dj ) = 1+||q−d j ||2

||dj − q||2 SC1 (q, dj ) SC2 (q, dj )


d1 3.61 0.03 0.22
d2 1.90 0.15 0.34
d3 0.99 0.37 0.50
d4 1.58 0.21 0.39
d5 2.19 0.11 0.31
Using either SC1 or SC2 the final ranking is: d3 > d4 > d2 > d5 > d1
(b) Use Cosine similarity

Page 5
SC(q, dj )
d1 0.16
d2 0.14
d3 0.45
d4 0.57
d5 0.27
Ranking: d4 > d3 > d5 > d1 > d2
(c) Use Jaccard similarity
SC(q, dj )
d1 0.05
d2 0.07
d3 0.25
d4 0.32
d5 0.12
Ranking: d4 > d3 > d5 > d2 > d1

6. Starting from the documents collection of Exercise 1, rank the documents with respect to query q =
{afraid—cat—funny} using the probabilistic model under the binary independence assumption.
Initialize the probability of term ti of appearing in a document relevant to the query (pi ) with even
odds.
Initialize the probability of term ti of appearing in a document not relevant to the query (ui ) assuming
that all documents are non relevant.
In the following iterations assume that the top-2 documents are relevant. In case of ties order the
documents (di ) by increasing index i.

Solution:
1. Consider only the terms appearing in the query
d1 = “cat—funny”
d2 = “”
d3 = “cat—afraid”
d4 = “cat—afraid”
d5 = “funny—cat—afraid”
Code the terms as t1 = “afraid”, t2 = “cat” and t3 = “funny”. N = 5 documents in the
collection.
Initialize the probability of term ti of appearing in documents non relevant to the query.
ui = nNi for each term ti , where ni is the number of documents containing term ti .
u1 = 35 = 0.6
u2 = 45 = 0.8
u3 = 25 = 0.4
Initialize the probability of term ti of appearing in documents relevant to the query.
p1 = 0.5
p2 = 0.5
p3 = 0.5

Page 6
2. 1st iteration. Compute the Similarity Coefficient between each document and the query
p2 p3
SC(d1 , q) = log2 1−p 2
+ log2 1−u
u2
2
+ log2 1−p 3
+ log2 1−u
u3
3
= −1.42
SC(d2 , q) = −∞ (no query terms)
p1 p2
SC(d3 , q) = log2 1−p 1
+ log2 1−u
u1
1
+ log2 1−p 2
+ log2 1−u
u2
2
= −2.58
p1 1−u1 p2 1−u2
SC(d4 , q) = log2 1−p1 + log2 u1 + log2 1−p2 + log2 u2 = −2.58
p1 p2 p3
SC(d5 , q) = log2 1−p 1
+ log2 1−u
u1
1
+ log2 1−p 2
+ log2 1−u
u2
2
+ log2 1−p 3
+ log2 1−u3
u3
= −2

Rank the documents and set as relevant the top-2 documents.


d1 > d 5 > d 3 > d 4 > d 2

Compute pi and ui for each term ti .


pi = sSi , where si is the number of relevant documents containing term ti and S is the total
number of relevant documents.
p1 = 21 = 0.5
p2 = 22 = 1
p3 = 22 = 1

3−1
u1 = 3
= 0.67
4−2
u2 = 3
= 0.67
2−2
u3 = 3
=0

3. 2nd iteration. Compute the Similarity Coefficient between each document and the query
SC(d1 , q) = 3 · lim→0 log2 1−

SC(d2 , q) = −∞ (no query terms)
SC(d3 , q) = lim→0 log2 1−

SC(d4 , q) = lim→0 log2 1−

SC(d5 , q) = 3 · lim→0 log2 1−


Rank the documents and set as relevant the top-2 documents.


d1 > d 5 > d 3 > d 4 > d 2
Convergence reached.

7. Consider a documents collection made of 100 documents.


Given a query q, the set of documents relevant to the users is D∗ = {d3 , d12 , d34 , d56 , d98 }. An IR system
retrieves the following documents D = {d3 , d12 , d35 , d56 , d66 , d88 , d95 }
(a) Compute the number of True-Positives, True-Negatives, False-Positives, False-Negatives
(b) Compute Precision, Recall, Balanced F-measure, Accuracy

Solution:
(a) Compute the number of True-Positives, True-Negatives, False-Positives, False-Negatives
TP = 3
FP = 4

Page 7
FN = 2
T N = 91
(b) Compute Precision, Recall, Balanced F-measure, Accuracy
P = 73
R = 35
F = 21
94
A = 100

8. An IR system produces the following rankings in answer to queries q1 and q2 . The underscored documents
are the ones relevant to the user.
R q1 q2
1 A F
2 L G
3 G D
4 F E
5 D L
6 E I
7 B H
8 H C
9 I B
10 C A
(a) Draw the precision-recall curve and the interpolated precision-recall curve
(b) Compute the Mean Average Precision
(c) Compute the R-precision
(d) Draw the Receiver-Operating-Characteristic

Solution:
(a) Draw the precision-recall curve and the interpolated precision-recall curve
Retrieved documents Pq1 Rq1 Pq2 Rq2
1 1/1 1/5 1/1 1/4
2 1/2 1/5 2/2 2/4
3 2/3 2/5 2/3 2/4
4 2/4 2/5 3/4 3/4
5 2/5 2/5 3/5 3/4
6 3/6 3/5 3/6 3/4
7 4/7 4/5 3/7 3/4
8 5/8 5/5 3/8 3/4
9 5/9 5/5 4/9 4/4
10 5/10 5/5 4/10 4/4

Page 8
Precision-Recall for q1
1

0.9

0.8

0.7

0.6

Precision
0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall

Precision-Recall for q2
1

0.9

0.8

0.7

0.6
Precision

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall

(b) Compute the Mean Average Precision

1/1 + 2/3 + 3/6 + 4/7 + 5/8


APq1 = = 0.67
5

1/1 + 2/2 + 3/4 + 4/9


APq2 = = 0.80
4

0.67 + 0.80
M AP = = 0.74
2
(c) Compute the R-precision

Rpq1 = 2/5

Rpq2 = 3/4

Page 9
(d) Draw the Receiver-Operating-Characteristic
Retrieved documents F P rq1 T P rq1 F P rq2 T P r q2
1 0/5 1/5 0/6 1/4
2 1/5 1/5 0/6 2/4
3 1/5 2/5 1/6 2/4
4 2/5 2/5 1/6 3/4
5 3/5 2/5 2/6 3/4
6 3/5 3/5 3/6 3/4
7 3/5 4/5 4/6 3/4
8 3/5 5/5 5/6 3/4
9 4/5 5/5 5/6 4/4
10 5/5 5/5 6/6 4/4

ROC for q1
1

0.9

0.8

0.7
True-Positive rate

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False-Positive rate

ROC for q2
1

0.9

0.8

0.7
True-Positive rate

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False-Positive rate

9. Starting from the documents collection of Exercise 1, build an inverted index for the documents collec-
tion.

Page 10
Solution:
term coll. freq. postings list (freq)
afraid 3 3(1) → 4(1) → 5(1)
and 1 1(1)
better 1 2(1)
big 3 1(1) → 2(1) → 4(1)
cat 4 1(1) → 3(1) → 4(1) → 5(1)
dog 5 2(2) → 3(1) → 4(1) → 5(1)
funny 2 1(1) → 5(1)
is 5 1(1) → 2(1) → 3(1) → 4(1) → 5(1)
nice 1 1(1)
not 2 4(1) → 5(1)
of 3 3(1) → 4(1) → 5(1)
small 5 2(1) → 3(2) → 4(1) → 5(1)
than 1 2(1)

10. Given the following rankings


r1 r2 r3
A (0.9) B (0.9) B (0.8)
B (0.6) A (0.8) C (0.7)
D (0.5) D (0.7) D (0.6)
C (0.4) C (0.6) A (0.5)
(a) Compute the Borda’s winner
(b) Compute the Condorcet’s winner
(c) Compute the top-2 documents using the MedRank algorithm
(d) Compute the top-2 documents using the Fagin’s algorithm
(e) Compute the top-2 documents using the Fagin’s threshold algorithm

Solution:
(a) Compute the Borda’s winner
A=1·1+2·1+4·1=7
B =1·2+2·1=4
C = 2 · 1 + 4 · 2 = 10
D =3·3=9
Borda’s winner is B.
(b) Compute the Condorcet’s winner
B wins on A
A wins on C
A wins on D
B wins on C
B wins on D
D wins on C

Page 11
B A

D C

Condorcet’s winner is B.
(c) Compute the top-2 documents using the MedRank algorithm
1. Sequential access
r1 r2 r3
A B B
B is selected as 1st document
2. Sequential access
r1 r2 r3
A B B
B A C
A is selected as 2nd document
top-2 documents = (B, A)
(d) Compute the top-2 documents using the Fagin’s algorithm
1. Sequential access
r1 r2 r3
A (0.9) B (0.9) B (0.8)
2. Sequential access
r1 r2 r3
A (0.9) B (0.9) B (0.8)
B (0.6) A (0.8) C (0.7)
B visible in all the rankings
3. Sequential access
r1 r2 r3
A (0.9) B (0.9) B (0.8)
B (0.6) A (0.8) C (0.7)
D (0.5) D (0.7) D (0.6)
B, D visible in all the rankings. Compute the scores for all the extracted objects
{A, B, C, D} performing random access when needed.
A = 0.73 (r.a. to r3 )
B = 0.77
C = 0.57 (r.a. to r1 , r2 )
D = 0.60
top-2 documents = (B, A)
(e) Compute the top-2 documents using the Fagin’s threshold algorithm

Page 12
1. Sequential access to r1
r1 r2 r3
A (0.9)
Compute the score of A by making random access to r2 , r3 .
A = 0.73
R = (A)
th = +∞
2. Sequential access to r2
r1 r2 r3
A (0.9) B (0.9)
Compute the score of B by making random access to r1 , r3 .
B = 0.77
R = (B, A)
th = +∞
3. Sequential access to r3
r1 r2 r3
A (0.9) B (0.9) B (0.8)
Score of B already computed.
R = (B, A)
th = +∞
4. Compute the threshold.
th = 0.87
5. Sequential access to r1
r1 r2 r3
A (0.9) B (0.9) B (0.8)
B (0.6)
Score of B already computed.
R = (B, A)
th = 0.87
6. Sequential access to r2
r1 r2 r3
A (0.9) B (0.9) B (0.8)
B (0.6) A (0.8)
Score of A already computed.
R = (B, A)
th = 0.87
7. Sequential access to r2
r1 r2 r3
A (0.9) B (0.9) B (0.8)
B (0.6) A (0.8) C (0.7)
Compute the score of C by making random access to r1 , r2 .
C = 0.57

Page 13
R = (B, A)
th = 0.87
8. Compute the threshold.
th = 0.70
top-2 documents = (B, A)

11. Given the following rankings


r1 r2 r3
A B B
B A C
C C A
(a) Compute the optimal aggregation using the Kendall-Tau distance
(b) Compute the optimal aggregation using the Spearman’s footrule distance
(c) Compute the footrule aggregation using the median rank approximation

Solution:
(a) Compute the optimal aggregation using the Kendall-Tau distance
P3
K(r1 , rci ) K(r2 , rci ) K(r3 , rci ) j=1 K(rj , rci )
rc1 A B C 0 1 2 3
rc2 A C B 1 2 3 6
rc3 B A C 1 0 1 2
rc4 B C A 2 1 0 3
rc5 C A B 2 3 2 7
rc6 C B A 3 2 1 6
rc3 = (B, A, C) is the optimal aggregation under the Kendall-Tau distance
(b) Compute the optimal aggregation using the Spearman’s footrule distance
P3
F (r1 , rci ) F (r2 , rci ) F (r3 , rci ) j=1 F (rj , rci )
rc1 A B C 0 2 4 6
rc2 A C B 2 4 4 10
rc3 B A C 2 0 2 4
rc4 B C A 4 2 0 6
rc5 C A B 4 4 4 12
rc6 C B A 4 4 2 10
rc3 = (B, A, C) is the optimal aggregation under the Spearman’s footrule distance
(c) Compute the footrule aggregation using the median rank approximation
µ0 (A) = median{1, 2, 3} = 2
µ0 (B) = median{2, 1, 1} = 1
µ0 (C) = median{3, 3, 2} = 3
(B, A, C) is the footrule aggregation using the median rank approximation.

Page 14
Homework 3
Exercise 18.1; Exercise 18.2; Exercise 18.5; Exercise 18.8; Exercise 18.11 
Exercise 21.6; Exercise 21.10; Exercise 21.11 
Exercise 13.2; Exercise 13.9 
Exercise 14.2; Exercise 14.6; 
Exercise 15.2 
Exercise 16.3; Exercise 16.13; Exercise 16.17; Exercise 16.20 
 
 
Exercise 18.1 (0.5’)
What is the rank of the 3 × 3 diagonal matrix below?

1 1 0
0 1 1
1 2 1
Solution:
By applying Gauss elimination, we can get:
1 1 0 1 1 0 1 0 1
0 1 1 → 0 1 1 → 0 1 1 .
1 2 1 0 1 1 0 0 0

Hence, the rank of this matrix is 2.

Exercise 18.2 (0.5’)


Show that λ = 2 is an eigenvalue of
6 2
4 0
Find the corresponding eigenvector.
Solution:
If λ = 2, then det 6 8 0. Hence, λ = 2 is an eigenvalue of .
Suppose the corresponding eigenvector is , so we get

6 2
2
4 0
Solving the system of equations, we get x 2 .
Hence, any vector 0 is the corresponding eigenvector.
2

Exercise 18.5 (0.5’)


Verify that the SVD of the matrix in Equation (18.12) is
0.816 0.000
1.732 0.000 0.707 0.707
U 0.408 0.707 , Σ ,
0.000 1.000 0.707 0.707
0.408 0.707
by verifying all of the properties in the statement of Theorem 18.3.
Solution:
1 1
1 0 1 2 1
0 1
1 1 0 1 2
1 0
det 4 3 0
3, 1
1 1 2 1 1
1 0 1
0 1 1 1 0
1 1 0
1 0 1 0 1
2 1 1
det 1 1 0 4 3 0
1 0 1
3, 1, 0
It turns out that the first two largest eigenvalues of are the same as those of
.

Furthermore, Σ ,,Σ ,.

Exercise 18.8 (0.5’)


Compute a rank 1 approximation C1 to the matrix C in Exercise 18.12, using the SVD
as in Equation 18.13. What is the Frobenius norm of the error of this approximation?
Solution:
1.732 0.000 1.732 0.000
Σ , Σ
0.000 1.000 0.000 0.000
0.816 0.000
1.732 0.000 0.707 0.707
C UΣ V 0.408 0.707
0.000 0.000 0.707 0.707
0.408 0.707
0.9992 0.9992 1 1
0.4996 0.4996 0.5 0.5
0.4996 0.4996 0.5 0.5
0 0
X C C 0.5 0.5
0.5 0.5
Frobenius norm 0.5 0.5 0.5 0.5 1

Exercise 18.11 (1’)


Assume you have a set of documents each of which is in either English or in Spanish.
The collection is given in Figure 18.4.
Figure 18.5 gives a glossary relating the Spanish and English words above for your
own information. This glossary is NOT available to the retrieval system:
1. Construct the appropriate term-document matrix C to use for a collection consisting
of these documents. For simplicity, use raw term frequencies rather than
normalized tf-idf weights. Make sure to clearly label the dimensions of your matrix.
2. Write down the matrices U2, Σ2 and V2 and from these derive the rank 2
approximation
C2.
3. State succinctly what the (i, j) entry in the matrix CTC represents.
4. State succinctly what the (i, j) entry in the matrix CT
T
2 C2 represents, and why it differs from that in C C.
Solution:
1
Doc1 Doc2 Doc3 Doc4 Doc5 Doc6
hello 1 0 0 0 0 1
open 0 1 0 0 0 0
house 0 1 0 0 0 0
mi 0 0 1 0 0 0
casa 0 0 1 0 0 0
hola 0 0 0 1 1 0
Profesor 0 0 0 1 0 0
y 0 0 0 0 1 0
bienvenido 0 0 0 0 1 0
and 0 0 0 0 0 1
welcome 0 0 0 0 0 1
1 0 0 0 0 1
0 1 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 1 0 0 0
C 0 0 0 1 1 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 0 0 1
0 0 0 0 0 1
C is an 11x6 matrix.

2
Σ2 =
1.9021 0
0 1.8478

U2 = 0 0.7071
0.0000 0
-0.0000 0
0.0000 0
-0.0000 0
-0.7236 0
-0.2764 0
-0.4472 0
-0.4472 0
0 0.5000
0 0.5000

V2 =

0 0.3827
0 0
0 0
-0.5257 0
-0.8507 0
0 0.9239

C2 =

0.5000 0 0 0 0 1.2071
0 0 0 -0.0000 -0.0000 0
0 0 0 0.0000 0.0000 0
0 0 0 -0.0000 -0.0000 0
0 0 0 0.0000 0.0000 0
0 0 0 0.7236 1.1708 0
0 0 0 0.2764 0.4472 0
0 0 0 0.4472 0.7236 0
0 0 0 0.4472 0.7236 0
0.3536 0 0 0 0 0.8536
0.3536 0 0 0 0 0.8536

3. The (i, j) entry in the matrix CTC represents the number of terms occurring in
both document i and document j.
4. The (i, j) entry in the matrix C2TC2 represents the similarity between document
i and document j in the low dimensional space.

Exercise 21.6 (0.5’)


Consider a web graph with three nodes 1, 2 and 3. The links are as follows: 1 →

2, 3 → 2, 2 → 1, 2 → 3. Write down the transition probability matrices for the


surfer’s walk with teleporting, for the following three values of the teleport
probability: (a) a = 0; (b) a = 0.5 and (c) a = 1.
Solution:
(i)
0 1 0
1/2 0 1/2
0 1 0
(ii)
1/6 2/3 1/6
5/12 1/6 5/12
1/6 2/3 1/6
(iii)
1/3 1/3 1/3
1/3 1/3 1/3
1/3 1/3 1/3

Exercise 21.10 (0.5’)


Show that the PageRank of every page is at least a/N. What does this imply about
the difference in PageRank values (over the various pages) as a becomes close to 1?
Solution:
According to the definition of , we can find α/N. Hence,
α
α/N α/N
N
So the PageRank of every page is at least α/N.
As α becomes closer to 1, the impact of the link structure of the web graph gets
smaller. Hence, the difference in PageRank values over various pages will get
smaller.

Exercise 21.11 (0.5’)


For the data in Example 21.1, write a small routine or use a scientific calculator to
compute the PageRank values stated in Equation (21.6).
Solution 1:
x*P = x, x = [0.05 0.04 0.11 0.25 0.21 0.035 0.31].

Solution 2:
The matlab code is as follows:
P=[0.020.020.880.020.020.020.02;

     0.020.450.450.020.020.020.02;

     0.310.020.310.310.020.020.02;

     0.020.020.020.450.450.020.02;

     0.020.020.020.020.020.020.88;

     0.020.020.020.020.020.450.45;

     0.020.020.020.310.310.020.31;];

[WD]=eig(P');

x=W(:,1)';

x=x/sum(x);

We can get x = [0.05 0.04 0.11 0.25 0.21 0.035 0.31].

Exercise 13.2 (0.5’)


Which of the documents in Table 13.5 have identical and different bag of words
representations for (i) the Bernoulli model (ii) the multinomial model? If there are
differences, describe them.
Solution:
(i) For the Bernoulli model, the 3 documents are identical.

(ii) For the multinomial model, documents 1 and 2 are identical and they are
different from document 3, because the term London occurs twice in
documents 1 and 2, but occurs once in document 3.

Exercise 13.9 (1’)


Based on the data in Table 13.10, (i) estimate a multinomial Naive Bayes classifier, (ii)
apply the classifier to the test document, (iii) estimate a Bernoulli NB classifier, (iv)
apply the classifier to the test document. You need not estimate parameters that you
don’t need for classifying the test document.

Solution:
(Multinomial model)
i
1

2
2 1 1
|
5 7 4
0 1 1
|
5 7 12
1
̅
2
1 1 1
| ̅
5 7 6
2 1 1
| ̅
5 7 4
ii
1 1 1 1
| 5 ∝ | | ∙ ∙ ∙ 0.0026
2 4 4 12
1 1 1 1
̅| 5 ∝ ̅ | ̅ | ̅ ∙ ∙ ∙ 0.0035
2 6 6 4
So document 5 is not in Class China.

(Bernoulli model)
(iii)
1/2
2 1 3
|
2 2 4
1 1 1
| | |
2 2 2
0 1 1
| | |
2 2 4
1
̅
2
1 1 1
| ̅ | ̅ | ̅
2 2 2
2 1 3
| ̅
2 2 4
0 1 1
| ̅ | ̅ | ̅
2 2 4
iv
| 5
∝ | | 1 | 1 |

1 | 1 | 1 |
1 3 1 1 1 1 1 1
∙ ∙ ∙ 1 ∙ 1 ∙ 1 ∙ 1 ∙ 1 0.0066
2 4 4 2 2 2 4 4

̅| 5
∝ | ̅ | ̅ 1 | ̅ 1 | ̅

1 | ̅ 1 | ̅ 1 | ̅
1 1 3 1 1 1 1 1
∙ ∙ ∙ 1 ∙ 1 ∙ 1 ∙ 1 ∙ 1 0.0198
2 2 4 4 4 4 2 2
So document 5 is not in Class China.

Exercise 14.2 (0.5’)


Show that Rocchio classification can assign a label to a document that is different
from its training set label.
Solution:
C

A
Take the above picture as an example. There are 2 classes in the plane, with the left
one being much bigger than the right one. Then a large part of the left circle will be
misclassified like the point C. C is a document belonging to the A class in the training
set, but it is closer to B than A. So it will be labeled as the B class.

Exercise 14.6 (0.5’)


In Figure 14.14, which of the three vectors a, b, and c is (i) most similar to x
according to dot product similarity, (ii) most similar to x according to cosine
similarity, (iii) closest to x according to Euclidean distance?

Solution:
(i) <a, x> = 4, <b, x> = 16, <c, x> = 28.
So a is most similar to x.
, , ,
(ii) | || |
0.8944, | || |
1, | || |
0.9899

So b is most similar to x.
(iii) d(x, a) = 1.5811, d(x, b) = 2.8284, d(x, c) = 7.2111
So a is most similar to x.

Exercise 15.2 (0.5’)


The basis of being able to use kernels in SVMs (see Section 15.2.3) is that the
classification function can be written in the form of Equation (15.9) (where, for large
problems, most α are 0). Show explicitly how the classification function could be
written in this form for the data set from Example 15.1. That is, write f as a function
where the data points appear and the only variable is x.
Solution:
1 2 2
Assume the three points: , 1, , 1, , 1
1 0 3
α 0
1 2 2/5
α , α , ,
1 3 4/5
α α 2/5
2 1 2 2 11
f x , ,
5 1 5 3 5

Exercise 16.3 (1’)


Replace every point d in Figure 16.4 with two identical copies of d in the same class.
(i) Is it less difficult, equally difficult or more difficult to cluster this set of 34 points
as opposed to the 17 points in Figure 16.4? (ii) Compute purity, NMI, RI, and F5 for
the clustering with 34 points. Which measures increase and which stay the same after
doubling the number of points? (iii) Given your assessment in (i) and the results in
(ii), which measures are best suited to compare the quality of the two clusterings?
Solution:
(i) It is equally difficult.
(ii) purity 1/34 10 8 6 0.71
NMI = 0.36
TP = 97, FP = 80, FN = 96, TN = 288
P = 0.55, R = 0.50
RI = 0.686
F5 = 0.5
Purity and NMI stay the same, while RI and F5 increase.
(iii) Purity and NMI

Exercise 16.13 (0.5’)


Prove that RSSmin(K) is monotonically decreasing in K.
Solution:
In a clustering with i clusters, take a cluster with non-identical vectors. Splitting
this cluster in two will lower RSS.

Exercise 16.17 (0.5’)


Perform a K-means clustering for the documents in Table 16.3. After how many
iterations does K-means converge? Compare the result with the EM clustering in
Table 16.3 and discuss the differences.
Solution:
After 2 iterations K-means converges.
The K-means clustering converges faster than EM. One possible reason is that the
data might satisfy the hard clustering condition rather than the soft clustering
condition, which can be seen from the final results of EM.

Exercise 16.20 (0.5’)


The within-point scatter of a clustering is defined as
∑ ∑ ∈ ∑ ∈ | | .
Show that minimizing RSS and minimizing within-point scatter are equivalent.
Solution:

| |

| | 2 , | |
∈ ∈ ∈

1 1
| | 2 , | |
| | | |
∈ ∈ ∈ ∈

1
| | ,
| |
∈ ∈ ∈

1
2
∈ ∈

1
| | | | ,
2
∈ ∈ ∈ ∈

|w | | | ,
∈ ∈ ∈

1 1
| |
|w | 2
∈ ∈

So minimizing RSS and minimizing within-point scatter are equivalent.


Information Retrieval and Text Mining, WS 2012/2013 Assignment 2 - Solutions

Assignment 2 - Solutions

Exercise 1 (IIR 13) [2 P.]


Based on the data below, estimate a multinomial Naive Bayes classifier (the type of NB classifier we
introduced in class) and apply the classifier to the test document. Calculate the probability that the
classifier assigns the test document to c = Japan or c.

docID words in document in c = Japan?


training set 1 Kyoto Tokyo Taiwan yes
2 Japan Kyoto yes
3 Taipei Taiwan no
4 Macao Taiwan Beijing no
test set 5 Taiwan Taiwan Kyoto ?

Solution
We can compute the probability of a document d being in a class c with the following formula:

Q
P (c|d) ∝ P̂ (c) 1≤k≤nd P̂ (tk |c) (1)

Thus, for d5 and c = Japan, we need ...

i) The prior probabilities P̂ (c) and P̂ (c):

Nc Nc 2 1
P̂ (c) = P̂ (c) = = = =
N N 4 2

ii) The conditional probabilities P̂ (T aiwan|c), P̂ (T aiwan|c), P̂ (Kyoto|c) and P̂ (Kyoto|c).


We compute them by means of the following formula:

C(t, c) + 1
P̂ (t|c) = P 0
t0 ∈V C(t , c) + |V |

The vocabulary has 7 terms: Kyoto, Tokyo, Taiwan, Japan, Taipei, Macao, Beijing. There
are 5 tokens in the concatenation of all c documents. There are 5 tokens in the concatenation of
all c documents. Thus, the denominators have the form (5+7). The conditional probabilities for
both classes are then as follows:

P̂ (T aiwan|c) = (1 + 1)/(5 + 7) = 2/12


P̂ (T aiwan|c) = (2 + 1)/(5 + 7) = 3/12
P̂ (Kyoto|c) = (2 + 1)/(5 + 7) = 3/12
P̂ (Kyoto|c) = (0 + 1)/(5 + 7) = 1/12

Now we can put it all together and compute the class to which the test document will be assigned
using the formula (1):

Wintersemester 12/13 A2-1 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 2 - Solutions

P̂ (c|d) ∝ 1/2 · (2/12)2 · 3/12 = 1/2 · 12/(12 · 12 · 12) = 1/2 · 12/1728 = 1/288
P̂ (c|d) ∝ 1/2 · (3/12)2 · (1/12) = 1/2 · 9/(12 · 12 · 12) = 1/2 · 9/1728 = 1/384

Thus, the classifier assigns the test document to c = Japan.

Exercise 2 (IIR 3) [1 P.]


If you wanted to search for s*ng in a permuterm wildcard index, what key(s) would one do the lookup
on?

Solution
We would perform the lookup on the key: ng$s*.

Exercise 3 (IIR 3) [3 P.]


Compute the Levenshtein matrix for the distance between the strings “obama” (input) and “romney”
(output). Use this format (as introduced in class):
c a t c a t

0 1 1 2 2 3 3 4 4 5 5 6 6
1 0 2 2 3 3 4 3 5 5 6 6 7
c
1 2 0 1 1 2 2 3 3 4 4 5 5
2 2 1 0 2 2 3 3 4 3 5 5 6
a
2 3 1 2 0 1 1 2 2 3 3 4 4
3 3 2 2 1 0 2 2 3 3 4 3 5
t
3 4 2 3 1 2 0 1 1 2 2 3 3
After you have calculated the distance between the two strings: Trace the editing operations for one
possible editing path as demonstrated in class:
cost operation input output
1 insert * c
1 insert * a
1 insert * t
0 (copy) c c
0 (copy) a a
0 (copy) t t

Solution
Levenshtein matrix:

Wintersemester 12/13 A2-2 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 2 - Solutions

r o m n e y

0 1 1 2 2 3 3 4 4 5 5 6 6
1 1 2 1 3 3 4 4 5 5 6 6 7
o
1 2 1 2 1 2 2 3 3 4 4 5 5
2 2 2 2 2 2 3 3 4 4 5 5 6
b
2 3 2 3 2 3 2 3 3 4 4 5 5
3 3 3 3 3 3 3 3 4 4 5 5 6
a
3 4 3 4 3 4 3 4 3 4 4 5 5
4 4 4 4 4 3 4 4 4 4 5 5 6
m
4 5 4 5 4 5 3 4 4 5 4 5 5
5 5 5 5 5 5 4 4 5 5 5 5 6
a
5 6 5 6 5 6 4 5 4 5 5 6 5

Possible editing path:

cost operation input output


1 insert * r
0 (copy) o o
1 replace b m
1 replace a n
1 replace m e
1 replace a y

Exercise 4 (IIR 13) [2 P.]


Rank the documents in collection {d1 , d2 } for query q using the language model approach to IR
introduced in class with Jelinek-Mercer smoothing. Use the mixture coefficient λ = 0.4.

• d1 : The European Union Act 2011 prevents additional powers being passed to Brussels without
a referendum

• d2 : EU will not ban Chanel 5 perfumes over allergy findings

• Query q: European Union

Solution

Y
P (q|d) = [λP (tk |Md ) + (1 − λ)P (tk |Mc )]
1≤k≤|q|

P (q|d1 ) = [0.4 · 1/15 + 0.6 · 1/25] · [0.4 · 1/15 + 0.6 · 1/25] ≈ 0.0512 ≈ 0.0026
P (q|d2 ) = [0.4 · 0/10 + 0.6 · 1/25] · [0.4 · 0/10 + 0.6 · 1/25] = 0.0242 = 0.000576

⇒ Ranking: d1 > d2

Exercise 5 (IIR 12/13) [3 P.]

Wintersemester 12/13 A2-3 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 2 - Solutions

Consider the following frequencies for the class coffee for four terms in the first 100,000 documents of
Reuters-RCV1:

term N00 N01 N10 N11


brazil 98,012 102 1835 51
council 96,322 133 3525 20
producers 98,524 119 1118 34
roasted 99,824 143 23 10

a) Which two terms will be selected in frequency-based feature selection and why?

b) Compute the MI values and order the terms according to MI. Which two terms will be selected
in MI-based feature selection?

Solution

a) The terms brazil and producers will be selected in frequency-based feature selection because
they are the most frequent terms in the class coffee.

b) brazil :

98012 100000 · 98012


I(U ; C) = log2
100000 (102+98012)(1835+98012)
102 100000 · 102
+ log2
100000 (102+98012)(51+102)
1835 100000 · 1835
+ log2
100000 (51+1835)(1835+98012)
51 100000 · 51
+ log2
100000 (51+1835)(51+102)
≈ 0.0015536892

council :

96322 100000 · 96322


I(U ; C) = log2
100000 (133+96322)(3525+96322)
133 100000 · 133
+ log2
100000 (133+96322)(20+133)
3525 100000 · 3525
+ log2
100000 (20+3525)(3525+96322)
20 100000 · 20
+ log2
100000 (20+3525)(20+133)
≈ 0.0001774273

producers:

Wintersemester 12/13 A2-4 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 2 - Solutions

98524 99795 · 98524


I(U ; C) = log2
99795 (119+98524)(1118+98524)
119 99795 · 119
+ log2
99795 (119+98524)(34+119)
1118 99795 · 1118
+ log2
99795 (34+1118)(1118+98524)
34 99795 · 34
+ log2
99795 (34+1118)(34+119)
≈ 0.0010479995

roasted :

99824 100000 · 99824


I(U ; C) = log2
100000 (143+99824)(23+99824)
143 100000 · 143
+ log2
100000 (143+99824)(10+143)
23 100000 · 23
+ log2
100000 (10+23)(23+99824)
10 100000 · 10
+ log2
100000 (10+23)(10+143)
≈ 0.0006484759

Terms ranked by MI:

(1) brazil
(2) producers
(3) roasted
(4) council

The terms brazil and producers will be selected in MI-based feature selection.

Wintersemester 12/13 A2-5 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 1 - Solutions

Assignment 1 - Solutions

Exercise 1 (IIR 1) [1 P.]


Shown below is a portion of a positional index in the format:
term: doc1: hposition1, position2, . . . i; doc2: hposition1, position2, . . . i; etc.

angels: 2: h36,174,252,651i; 4: h12,22,102,432i; 7: h17i;


fools: 2: h1,17,74,222i; 4: h8,78,108,458i; 7: h3,13,23,193i;
fear: 2: h87,704,722,901i; 4: h13,43,113,433i; 7: h18,328,528i;
in: 2: h3,37,76,444,851i; 4: h10,20,110,470,500i; 7: h5,15,25,195i;
rush: 2: h2,66,194,321,702i; 4: h9,69,149,429,569i; 7: h4,14,404i;
to: 2: h47,86,234,999i; 4: h14,24,774,944i; 7: h199,319,599,709i;
tread: 2: h57,94,333i; 4: h15,35,155i; 7: h20,320i;
where: 2: h67,124,393,1001i; 4: h11,41,101,421,431i; 7: h16,36,736i;

Which document(s) (if any) match each of the following queries at which positions, where each ex-
pression within quotes is a phrase query? (i) “fools rush in” (ii) “fools rush in” AND “angels fear to
tread”.

Solution
(i) doc2:1, doc4:8, doc7:3,13 (ii) doc4:8 & 12

Exercise 2 (IIR 1) [3 P.]


Write a (Python) program [...]

Solution
See boolean.py in the assignment 1 ex 2 solution.zip file on the course homepage.

Exercise 3 (IIR 2) [1 P.]


The following pairs of words are stemmed to the same form by the German version of the Porter
stemmer1 . Which pairs, would you argue, should not be conflated? Give your reasoning.

Solution

(a) geräumige/geräumigen (→ geraum)


OK
(b) musik/musiker (→ musik)
Sometimes good, sometimes bad. E. g. when somebody wants to find out more about a certain
musical genre.
(c) neuer/neues (→neu)
Not OK, e. g. when somebody wants to get some information about the goalkeeper of the German
National Soccer Team.
(d) persönlich/persönlichkeit (→ person)
The two words are thematically far away from each other.
1
http://snowball.tartarus.org/algorithms/german/stemmer.html

Wintersemester 12/13 A1-1 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 1 - Solutions

(e) schlaf/schlafen (→ schlaf)


OK.

(f) unternehmer/unternehmung (→ unternehm)


Semantically too far away.

(g) wetten/wetter (→ wett)


Completely different words.

Exercise 4 (IIR 1) [1 P.]


For a conjunctive query, is processing postings list in order of size guaranteed to be optimal? Explain
why it is, or give an example where it is not.

Solution
Processing postings list in order of size (i.e. the shortest postings list first) is usually a good approach.
But it is not optimal e. g. in a conjunctive query with three terms:

term 1 −→ 1 2 3

term 2 −→ 2 3 4 5

term 3 −→ 10 11 20 30 50

As we can see there is no document containing all three query terms. If we would have checked the
first posting of the third list right at the beginning, we would have noticed that there is no intersection
between the first and the third postings list. That would make any further search superfluous.

Wintersemester 12/13 A1-2 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 3 - Solutions

Assignment 3 - Solutions

Exercise 1 (IIR 5) [3 P.]


Compute variable byte and γ-codes for the postings list 776, 801, 1101, 312513. Use gaps instead of
docIDs for all but the first entry.
Give the solution for variable bytes as a sequence of 8-bit blocks (as presented in class e.g. on slide 42 of
IIR 5). Give the solution for the γ-codes of the postings list as a sequence of 4 pairs of bit strings, where
the first bit string of each pair corresponds to a length and the second to an offset (see slide 48 of IIR 5).

Solution

1. Variable byte encoding, bytes in decimal: 6 136, 153, 2 172, 19 0 244;


bytes in binary: 00000110 10001000 10011001 00000010 10101100 00010011 00000000 11110100

2. Gamma encoding: 1111111110 100001000, 11110 1001, 111111110 00101100,


1111111111111111110 001100000001110100

Exercise 2 (IIR 5) [2 P.]


Consider the following sequence of γ-coded gaps: 011110001110111111010111110101110111.

(i) What is the sequence of gaps?

(ii) What is the sequence of postings? (the first entry is the docID of the first document)

Solution

(i) Gap sequence: 1 19 3 55 6 15

(ii) DocID sequence: 1 20 23 78 84 99

Exercise 3 (IIR 6) [4 P.]


Compute the ltc.lnn similarity between the query “digital phones” and the document “digital phones
and video phones and other phones” by filling out the empty columns in the table below. Assume
N = 10,000,000. Treat and and other as stop words. What is the final similarity score between the
query and the document? What is the corresponding Jaccard coefficient?

Solution
document query
word tf-raw tf-wght df idf weight n’lized tf-raw tf-wght weight product
digital 1 1 10,000 3 3 0.61 1 1 1 0.61
video 1 1 100,000 2 2 0.40 0 0 0 0
phones 3 1.48 50,000 2.3 3.4 0.69 1 1 1 0.69

Length of document vector: 32 + 22 + 3.42 ≈ 4.96
Normalized document term weights: 3/4.96 ≈ 0.61, 2/4.96 ≈ 0.40, 3.4/4.96 ≈ 0.69

Wintersemester 12/13 A3-1 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 3 - Solutions

• Similarity score: 0.61 + 0.69 = 1.30


|Q ∩ D| |{digital, phones}| 2
Jaccard(Q, D) = = =
|Q ∪ D| |{digital, video, phones}| 3

Exercise 4 (IIR 9) [Optional: 2 P.]


Suppose that a user’s initial query is “cheap CDs cheap DVDs extremely cheap CDs”. The user
examines two documents, d1 and d2. She judges d1, with the content “CDs software cheap CDs”
relevant and d2 with content “cheap thrills DVDs” nonrelevant. Assume that we are using direct
term frequency (with no scaling and no document frequency). Do not length-normalize vectors for
this exercise. Using Rocchio relevance feedback as in Equation 9.3 (book: page 182), what would the
revised query vector be after relevance feedback? Assume α = 1, β = 0.8, γ = 0.2. Keep in mind that
negative term weights are treated in a special way.

Solution

Equation 9.3:

1 X ~ 1 X
~qm = α~q0 + β dj − γ d~j
|Dr | |Dnr |
d~j ∈Dr d~j ∈Dnr

Query vector after relevance feedback:


word q d1 d2 αq βd1 γd2 rocchio
CDs 2 2 0 2 1.6 0 3.6
cheap 3 1 1 3 0.8 0.2 3.6
DVDs 1 0 1 1 0 0.2 0.8
extremely 1 0 0 1 0 0 1.0
software 0 1 0 0 0.8 0 0.8
thrills 0 0 1 0 0 0.2 0.0

Wintersemester 12/13 A3-2 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 4

Assignment 4

Exercise 1 (IIR 8) [3 P.]


Sentiment analysis ”aims to determine the attitude of a speaker or a writer with respect to some topic
or the overall contextual polarity of a document”1 . Assume, the IMS develops a novel approach for
sentiment analysis. To test the success of the new approach a gold standard is needed. Thus, two
annotators independently annotate 200 documents regarding whether they convey a positive attitude
or not. The following table shows how often they agreed. Calculate the Kappa coefficient for the
agreement between the two annotators. Will it be possible to construct a gold standard from this
annotated data?

Judge 2: Document positive?


Yes No Total
Judge 1: Yes 120 30 150
Document positive? No 40 10 50
Total 160 40 200

Solution
We calculate the Kappa measure by means of the following formula:

P (A) − P (E)
κ=
1 − P (E)

where

• P (A) = proportion of time judges agree

• P (E) = what agreement would we get by chance

Observed proportion of the times the judges agreed P (A) = (120 + 10)/200 = 130/200 = 0.65

Pooled marginals
P (positive) = (160 + 150)/(200 + 200) = 310/400 = 0.775
P (not positive) = (40 + 50)/(200 + 200) = 90/400 = 0.225

Probability that the two judges agreed by chance P (E) = P (not positive)2 + P (positive)2 = 0.2252 +
0.7752 = 0.050625 + 0.600625 = 0.65125
Kappa statistic κ = (P (A) − P (E))/(1 − P (E)) = (0.65 − 0.65125)/(1 − 0.65125) = −0.00125/0.34875
= −0.00358422
The agreement is too low to be a reliable basis for a gold standard.

Exercise 2 (IIR 14) [8 P.]


Develop a simple kNN classifier which asks the user for a k and then assigns a class to new documents.
[...]
Solution
See knn.py in the assignment 4 ex 2 solution.zip file on the course homepage.
1
http://en.wikipedia.org/wiki/Sentiment_analysis

Wintersemester 12/13 A4-1 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 5 - Solutions

Assignment 5 - Solutions

Exercise 1 (IIR 15) [4 P.]


As you know from class, a Support Vector Machine (SVM) is estimated by finding the smallest vector
w
~ so that

~ T xi + b) ≥ 1.
yi (w (1)

Estimate a SVM for the data given below. Find the support vectors and the general form of the
normal vector before solving the equation system 1 for the best matching w.
~ The Chapter 15.1 of the
IR book may help you with this exercise.
5
ut ut ut

ut ut
3
b

b
2
b

1
b b b

0
0 1 2 3 4 5 6
Solution
(1) Correct solution:
5
ut ut ut

ut ut
3
b

b
2
b

1
b b b

0
0 1 2 3 4 5 6
The weight vector is parallel to the shortest line connecting points of the two classes. That is, the line
between ~x1 = (3, 2) and ~x2 = (4.5, 3), giving a weight vector of ~x2 − ~x1 = (1.5, 1). But if we take the
two points ~x1 and ~x2 as support vectors for the classifier margin we see that the point ~x3 = (3, 4.5)
is inside the margin (see second-best solution below). Thus, the decision hyperplane constructed only
from the two points ~x1 and ~x2 would not guarantee the largest margin possible. Therefore we have to
consider the point ~x3 as an additional support vector.
Before we can arrange an equation system to find the best matching w ~ we have to calculate the weight
vector w~ which is perpendicular to the decision hyperplane. Since there are two support vectors from
the triangles’ class (~x2 and ~x3 ) on the top margin line we first compute the vector that connects them:
~x2 − ~x3 = (1.5, −1.5)T . Next, we compute a possible perpendicular vector by means of the dot product

Wintersemester 12/13 A5-1 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 5 - Solutions

(we know that two vectors are perpendicular if their dot product is 0): 1.5 · 1 + (−1.5) · 1 = 0. Thus,
~ = (1, 1)T . So we already know that the solution is w
w ~ = (1a, 1a) for some a. Using this knowledge
we can arrange a system of equations which considers all three support vectors:
(w
~ · ~x1 + b) = 5a + b = −1
(w
~ · ~x2 + b) = 7.5a + b = 1
(w
~ · ~x3 + b) = 7.5a + b = 1
By solving this system of equations we get the following values for a und b:

4
a=
5
b = −5

So the optimal hyperplane is given by w


~ = (1 · 4/5, 1 · 4/5) = (4/5, 4/5) and b = −5.
√ √
~ = 2/ 0.82 + 0.82 ≈ 2/ 1.28 ≈ 2/1.131 ≈ 1.77.
Optional: The margin ρ is 2/|w|

(2) Second-best solution:


5
ut ut ut

ut ut
3
b

b
2
b

1
b b b

0
0 1 2 3 4 5 6
The weight vector is parallel to the shortest line connecting points of the two classes. That is, the
line between ~x1 = (3, 2) and ~x2 = (4.5, 3), giving a weight vector of ~x2 − ~x1 = (1.5, 1). In this (not
completely correct) solution we ignore that there is another point of the triangles’ class that is within
the functional margin.
Thus, we know that w
~ = (1.5a, 1a) for some a. So we get the following system of equations:
(w
~ · ~x1 + b) = −6.5a + b = −1
(w
~ · ~x2 + b) = 9.75a + b = 1
By solving the system of equations we get the following values for a und b:

8
a=
13
b = −5

So the optimal hyperplane is given by w


~ = (1.5 · 8/13, 1 · 8/13) = (12/13, 8/13) ≈ (0.923, 0.615) and
b = −5.
√ √
~ = 2/ 0.9232 + 0.6152 ≈ 2/ 1.23 ≈ 2/1.109 ≈ 1.803.
Optional: The margin ρ is 2/|w|

Wintersemester 12/13 A5-2 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 5 - Solutions

Exercise 2 (IIR 16) [4 P.]


(i) Perform a 2-means clustering to convergence for the points below. Start with the two seeds a and b.
For each iteration give (I) the coordinates of the centroids (II) the assignments of points to centroids.
(ii) Give the coordinates of a fifth point e and two centroids with the following properties:

1. the two centroids are a local optimum; that is, one iteration of reassignment and recomputation
will not change the position of the centroids

2. the two centroids are not the global optimum.

(iii) Give two centroids that are better for the 5 points than the ones in (ii). (No need to prove global
optimality, but show they are better than in (ii).)

Solution

7 7

d
6 6

5 5

4 4

3 3

2 2

b
e
1 1

0 0

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7
(i)
1. centroids A = (1,1), B = (1,2)
1. assignment: A: a; B: b, c, d
2. centroids A = (1,1) B = (11/3,14/3) = (3.67,4.67)
2. assignment: A: a,b; B: c,d
3. centroids: A: (1,1.5) B: (5,6)
3. assignment A: a,b; B: c,d
4. centroids: A: (1,1.5) B: (5,6)
=⇒ converged

(ii)
e = (6, 1.5), A = (3, 2.375), D = (6, 7)
|A, c|2 = (3 − 4)2 + (2.375 − 5)2 ≈ 7.89
|D, c|2 = (6 − 4)2 + (7 − 5)2 = 8
So c is closer to A than to D. a, b and e are also closer to A than to D. So this is a stable set of two
centroids for the five points.
RSS = sum of squared distances of a, b, c, d, e (where a, b, c, e are assigned to A and d is assigned to
D): ((3−1)2 +(2.375−1)2 )+((3−1)2 +(2.375−2)2 )+7.89+(02 +02 )+((3−6)2 +(2.375−1.5)2 ) ≈ 27.69

Wintersemester 12/13 A5-3 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 5 - Solutions

(iii)
A = (8/3, 1.5), D = (5, 6) is a better set of centroids than the ones in (ii). Proof:
RSS = sum of squared distances of a, b, c, d, e (where a, b, e are assigned to A and c, d are assigned
to D): ((8/3 − 1)2 + 0.52 ) + ((8/3 − 1)2 + 0.52 ) + (12 + 12 ) + (12 + 12 ) + ((8/3 − 6)2 + 02 ) ≈ 21.17 < 27.69

Exercise 3 (IIR 16) [1 P.]


Why are documents that do not use the same term for the concept car likely to end up in the same
cluster in K-means clustering?

Solution
If two document are thematically similar they contain similar terms. Even if a frequent term does
not occur in a document the document will be correctly clustered due to other terms common to the
topic.

Exercise 4 (IIR 16) [1 P.]


Two of the possible termination conditions for K-means were (1) assignment does not change, (2) cen-
troids do not change (IR book, page 361). Do these two conditions imply each other? Why or why not?

Solution
The two conditions imply each other. If the assignment does not change then the centroids remain
the same. And if the centroids do not change that means that no reassignment took place.

Wintersemester 12/13 A5-4 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 6

Assignment 6

Exercise 1 (IIR 18) [3 P.]


Given the singular value decomposition of the matrix C from the lecture.
C d1 d2 d3 d4 d5 d6 Σ 1 2 3 4 5
ship 1 0 1 0 0 0 1 2.16 0.00 0.00 0.00 0.00
boat 0 1 0 0 0 0 2 0.00 1.59 0.00 0.00 0.00
ocean 1 1 0 0 0 0 3 0.00 0.00 1.28 0.00 0.00
wood 1 0 0 1 1 0 4 0.00 0.00 0.00 1.00 0.00
tree 0 0 0 1 0 1 5 0.00 0.00 0.00 0.00 0.39

V 1 2 3 4 5
U 1 2 3 4 5
1 0.75 -0.29 -0.28 0.00 0.53
1 0.44 -0.30 -0.57 0.58 -0.25
2 0.28 -0.53 0.75 0.00 -0.29
2 0.13 -0.33 0.59 0.00 -0.73
3 0.20 -0.19 -0.45 0.58 -0.63
3 0.48 -0.51 0.37 0.00 0.61
4 0.45 0.63 0.20 0.00 -0.19
4 0.70 0.35 -0.15 -0.58 -0.16
5 0.33 0.22 -0.12 0.58 -0.41
5 0.26 0.65 0.41 0.58 0.09
6 0.12 0.41 0.33 0.58 0.22

1. Calculate the reduced matrix C3 . That is the term-document matrix C reduced to 3 dimensions
(see slides).

2. Compare the rankings of the query “ship ocean” for the matrices C and C3 : Rank the documents
after relevance.

Solution
a)
To calculate C3 we need the matrix Σ with the first three single values (other values are ”zeroed out”):

Σ3 1 2 3 4 5
1 2.16 0.00 0.00 0.00 0.00
2 0.00 1.59 0.00 0.00 0.00
3 0.00 0.00 1.28 0.00 0.00
4 0.00 0.00 0.00 0.00 0.00
5 0.00 0.00 0.00 0.00 0.00

Furthermore V has to be transposed since the formula for singular value decomposition is C = U ΣV T :

VT 1 2 3 4 5 6
1 0.75 0.28 0.20 0.45 0.33 0.12
2 -0.29 -0.53 -0.19 0.63 0.22 0.41
3 -0.28 0.75 -0.45 0.20 -0.12 0.33
4 0.00 0.00 0.58 0.00 0.58 0.58
5 0.53 -0.29 -0.63 -0.19 -0.41 0.22

Now we can multiply U with Σ3 and get the following matrix A:

Wintersemester 12/13 A6-1 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 6

A 1 2 3 4 5
1 0.950 -0.477 -0.730 0.000 0.000
2 0.281 -0.525 0.755 0.000 0.000
3 1.037 -0.811 0.474 0.000 0.000
4 1.512 0.557 -0.192 0.000 0.000
5 0.562 1.034 0.525 0.000 0.000

After multiplying A by matrix V T we finally get the matrix C3 :


C3 d1 d2 d3 d4 d5 d6
ship 1.055 -0.029 0.609 -0.019 0.296 -0.322
boat 0.152 0.923 -0.184 -0.053 -0.113 0.068
ocean 0.880 1.076 0.148 0.051 0.107 -0.052
wood 1.026 -0.016 0.283 0.993 0.645 0.346
tree -0.025 0.003 -0.320 1.009 0.350 0.665

Remark:
The false matrix C3 if multiplied with V instead of V T :
C3 d1 d2 d3 d4 d5
ship 0.433 0.116 -0.295 -0.423 1.102
boat 0.215 0.053 -0.812 0.438 -0.174
ocean 0.645 0.039 -1.112 0.275 0.486
wood 1.252 -0.697 0.081 -0.111 0.761
tree 0.816 -0.811 0.382 0.305 -0.333

b)
C : d1 , d2 , d3 or d1 , d3 , d2

C3 : d1 , d2 , d3 , d5 , d4 , d6
Computation:

1. d1 : 1.055 + 0.88 = 1.935

2. d2 : (-0.029) + 1.076 = 1.047

3. d3 : 0.609 + 0.148 = 0.757

4. d5 : 0.296 + 0.107 = 0.403

5. d4 : (-0.019) + 0.051 = 0.032

6. d6 : (-0.322) + (-0.052) = -0.374

Wintersemester 12/13 A6-2 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 6

Exercise 2 (IIR 19) [3 P.]


The shingle representations of three documents are as follows: d3 = (0, 0, 1, 0, 0, 0, 1)T , d4 = (0, 0, 1, 0, 0, 0, 0)T ,
d5 = (1, 1, 1, 0, 1, 1, 1)T
We will use sketches of size 2. The two elements of a sketch are defined by the permutations (2 × n + 2)
mod 7 and (4 × n + 1) mod 7. Based on this setup, what are the estimates of the three Jaccard co-
efficients J(d3 , d4 ), J(d3 , d5 ), and J(d4 , d5 )? Use the kind of table introduced in class to visualize the
permutations and to calculate the final sketches.

Solution
d3 slot d4 slot d5 slot
∞ ∞ ∞
d3 d4 d5 ∞ ∞ ∞
s1 0 0 1 h(1) = 4 – ∞ – ∞ 4 4
s2 0 0 1 g(1) = 5 – ∞ – ∞ 5 5
s3 1 1 1 h(2) = 6 – ∞ – ∞ 6 4
s4 0 0 0 g(2) = 2 – ∞ – ∞ 2 2
s5 0 0 1 h(3) = 1 1 1 1 1 1 1
s6 0 0 1 g(3) = 6 6 6 6 6 6 2
s7 1 0 1 h(4) = 3 – 1 – 1 – 1
g(4) = 3 – 6 – 6 – 2
h(5) = 5 – 1 – 1 5 1
Hash functions for the g(5) = 0 – 6 – 6 0 0
permutation: h(6) = 0 – 1 – 1 0 0
h(x) = (2x + 2) mod 7 g(6) = 4 – 6 – 6 4 0
g(x) = (4x + 1) mod 7 h(7) = 2 2 1 – 1 2 0
g(7) = 1 1 1 – 6 1 0

Final sketches: d3 = (1, 1), d4 = (1, 6), d5 = (0, 0)

1+0
J(d3 , d4 ) = = 1/2
2
0+0
J(d3 , d5 ) = =0
2
0+0
J(d4 , d5 ) = =0
2

Wintersemester 12/13 A6-3 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 7 - Solutions

Assignment 7 - Solutions

Exercise 1 (IIR 21) [1 P.]


What is ergodicity and why is it important for PageRank?

Solution
ergodic = aperiodic (no periodic behavior) and irreducible (roughly: there is a path from every page
to every other page)
PageRank is well-defined if surfing the web graph is ergodic

Exercise 2 (IIR 21) [3 P.]

q1 q2

q3

For the web graph in the figure, compute PageRank, hub and authority scores for each of the three
pages. Also give the relative ordering of the 3 nodes indicating any ties.
Assume that at each step of the PageRank random walk, we teleport to a random page with probability
0.1, with a uniform distribution over which particular page we teleport to. Normalize the hub and
authority scores so that the maximum hub/authority score is 1.
Hint: Using symmetries to simplify and solving with linear equations might be easier than using iter-
ative methods.

Solution
Since the in-degree of A is 0, the steady-visit rate (or rank) of A is 0.1 · 1/3 = 1/30 (from teleport).
By symmetry, rank(B) = rank(C). Thus, rank(B)=rank(C) = 29/60.

PageRank, (1) Solution using power method


q1 q2 q3
q1 0 0 0
Transition matrix P without teleport: q2
′ 1 0 0
q3 1 0 0

q1 q2 q3
q1 1/3 1/3 1/3
Transition matrix P with teleport:
q2 14/15 1/30 1/30
q3 14/15 1/30 1/30
For initialization: (1/3, 1/3, 1/3):
~xP 1 0.733333 0.133333 0.133333

Wintersemester 12/13 A7-1 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 7 - Solutions

~xP 2 0.493333 0.253333 0.253333


~xP 3 0.637333 0.181333 0.181333
~xP 4 0.550933 0.224533 0.224533
~xP 5 0.602773 0.198613 0.198613
~xP 6 0.571669 0.214165 0.214165
~xP 7 0.590332 0.204834 0.204834
~xP 8 0.579134 0.210433 0.210433
~xP 9 0.585853 0.207074 0.207074
~xP 10 0.581822 0.209089 0.209089
~xP 11 0.584240 0.207880 0.207880
~xP 12 0.582789 0.208605 0.208605
~xP 13 0.583660 0.208170 0.208170
~xP 14 0.583137 0.208431 0.208431
~xP 15 0.583451 0.208275 0.208275
~xP 16 0.583263 0.208369 0.208369
~xP 17 0.583376 0.208312 0.208312
~xP 18 0.583308 0.208346 0.208346
~xP 19 0.583349 0.208326 0.208326
~xP 20 0.583324 0.208338 0.208338
~xP 21 0.583339 0.208331 0.208331
~xP 22 0.583330 0.208335 0.208335
~xP 23 0.583335 0.208332 0.208332
~xP 24 0.583332 0.208334 0.208334
~xP 25 0.583334 0.208333 0.208333
~xP 26 0.583333 0.208334 0.208334
~xP 27 0.583334 0.208333 0.208333
~xP 28 0.583333 0.208333 0.208333
~xP 29 0.583333 0.208333 0.208333
~xP 30 0.583333 0.208333 0.208333
=⇒ Ranking: d1 > d2 = d3

PageRank, (2) Solution 2


d2 and d3 have the same PageRank x. Let y be the PageRank of d1 . We have:

28 1
(2x) + y = y
30 3
28 2
x− y =0
15 3
28 2
x − (1 − 2x) = 0
15 3
48 2
x=
15 3

=⇒
2 · 15 5
x= = = 0.20833333333333
3 · 48 24
14 7
y =1−2·x= = = 0.5833333333333
24 12

HITS, Solution 1

Wintersemester 12/13 A7-2 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 7 - Solutions

matrix A matrix AT
0 0 0 0 1 1
1 0 0 0 0 0
1 0 0 0 0 0
matrix AAT matrix AT A
0 0 0 2 0 0
0 1 1 0 0 0
0 1 1 0 0 0

~a = (1 1 1)T
(AT A)~a = (2 0 0)T
(AT A)2~a = (4 0 0)T
(AT A)3~a = (8 0 0)T

~h = (1 1 1)T
(AAT )~h = (0 2 2)
(AAT )2~h = (0 4 4)
(AAT )3~h = (0 8 8)

After normalization: ~a = (1 0 0), ~h = (0 1 1)


Authority ranking: d1 > d2 = d3
Hub ranking: d2 = d3 > d1

HITS, Solution 2
Authorities: authority(d2 ) = authority(d3 ) = 0 since nobody is pointing to these two pages. authority(d1 ) >
0 since somebody is pointing to d1 , thus value greater zero. After normalization (there is no page with
a greater authority) this value is 1.0.
Hubs: By similar reasoning: hub(d1 ) = 0, hub(d2 ) = hub(d3 ) > 0.
There is no page with a hub score higher than d2 and d3 , thus hub(d2 ) = hub(d3 ) = 1.

Exercise 3 (IIR 6) [3 P.]


One
qP measure of the similarity of two vectors is the Euclidean distance between them: |~x − ~ y| =
M 2
i=1 (xi − yi ) . Given a query q and documents d1 , d2 , . . ., we may rank the documents di in order
of increasing Euclidean distance from q. Show (by a mathematical proof) that if q and the di are all
normalized to unit vectors, then the rank ordering produced by Euclidean distance is identical to that
produced by cosine similarities.

Solution
(qi − wi )2 =
P 2
qi − 2 qi wi + wi2 = 1 − 2 qi wi + 1 = 2(1 − qi wi )
P P P P P
P 2
(Note that for a normalized vector ~x, we have: xi = 1.)
2 2 2 < 2 ⇔ 2(1 −
P P P
Thus: P|~q − ~v | < P
|~
q − w|
~ ⇔
P |~
q − ~
v | < |~
q − w|
~ ⇔ (q i − v i ) (q i − wi ) q i vi ) <
2(1 − qi wi ) ⇔ qi vi > qi wi ⇔ cos(~q , ~v ) > cos(~q, w)
~
This proves that ordering normalized vectors according to increasing distance is the same as ordering
them according to decreasing cosine similarity.

Wintersemester 12/13 A7-3 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 7 - Solutions

Exercise 4 (IIR 8) [3 P.]


An unranked document retrieval approach is tested on a test set that consists of 300 documents. In
response to a query 200 documents are retrieved of which 170 docs are relevant to the query and 30 not
relevant. From the entire test corpus 190 documents are considered to be relevant for the mentioned
query.

(a) Calculate precision, recall, accuracy and (balanced) f-measure of the presented classifier.
(b) Why do we usually have to face a tradeoff between precision and recall?

Solution

relevant nonrelevant
(a) retrieved 170 30
not retrieved 20 80

• P recision = tp/(tp + f p) = 170/(170 + 30) = 0.85


• Recall = tp/(tp + f n) = 170/(170 + 20) ≈ 0.895
• Accuracy = (tp + tn)/(tp + f p + f n + tn) = (170 + 80)/(170 + 30 + 20 + 80) ≈ 0.83
• F-measure = 2P R/(P + R) ≈ 0.87

(b) Because different users have different needs. Some users want to get documents that match their
query as exact as possible (precision). They do not want to read all the documents that are
available for a certain topic (recall). On the other side there are people who want to obtain all
documents related to a topic, e.g. lawyers who want to get all law documents related to drug
possession. They need high recall.
With respect to an information retrieval system we can always achieve a recall of 1 when we
retrieve the whole collection, but then precision will be very low. When we want high precision,
we can do this by only returning the documents where we are very sure that they are relevant,
in the extreme only 1 document. This will of course create a very low recall. In practice, we will
never have these extreme behaviours, but we nearly always face a decision if we want to increase
precision or recall.

Exercise 5 (IIR 13-16) [5 P.]


As we have seen in chapter 14 there exist several types of classification algorithms.

(a) List the classification algorithms we have seen in chapters 13, 14 and 15 and give their key
properties.
(b) Usually, we have dealt with only 2 classes in our examples. What changes with respect to the
classification algorithms in (a) do we need to make if we want to classify more than 2 classes?
(c) Explain the difference between classification and clustering.

Solution

(a) Linear: Naive Bayes [Probabilistic; Independence assumption: One feature is independent from
other features], Rocchio [Calculates Centroids and assigns new documents the class of the nearest
centroid], SVM [Large margin, uses support vectors to calculate a decision hyperplane between
classes]
Non-linear: kNN [decision boundary consists of locally linear segments, no training needed]

Wintersemester 12/13 A7-4 Kisselew, Kessler, Müller & Schütze


Information Retrieval and Text Mining, WS 2012/2013 Assignment 7 - Solutions

(b) If we have e.g. 4 classes: Perform a first binary classification for c1 and {c2 , c3 , c4 }. In the next
step classify c2 and {c3 , c4 } etc.

(c) Classification is supervised, i.e. a classifier is trained on a labeled dataset. Clustering on the
other side is unsupervised: It is carried out on unlabeled data.

Wintersemester 12/13 A7-5 Kisselew, Kessler, Müller & Schütze


Tutorial 7: Information Retrieval
Informatics 1 Data & Analysis

Week 9, Semester 2, 2013–2014

This worksheet has three parts: tutorial Questions, followed by some Examples and their Solutions.

• Before your tutorial, work through and attempt all of the Questions in the first section.

• The Examples are there for additional preparation, practice, and revision.

• Use the Solutions to check your answers, and read about possible alternatives.

You must bring your answers to the main questions along to your tutorial. You will need to be
able to show these to your tutor, and may be exchanging them with other students, so it is best to
have them printed out on paper.
If you cannot do some questions, write down what it is that you find challenging and use this to
ask your tutor in the meeting.
Tutorials will not usually cover the Examples, but if you have any questions about those then write
them down and ask your tutor, or go along to InfBASE during the week.
It’s important both for your learning and other students in the group that you come to tutorials
properly prepared. If you have not attempted the main tutorial questions, then you may be sent
away from the tutorial to do them elsewhere.
Some exercise sheets contain material marked with a star ?. These are optional extensions.
Data & Analysis tutorials are not formally assessed, but they are a compulsory and important part
of the course. If you do not do the exercises then you are unlikely to pass the exam.
Attendance at tutorials is obligatory: if you are ill or otherwise unable to attend one week then
email your tutor, and if possible attend another tutorial group in the same week.
Please send any corrections and suggestions to Ian.Stark@ed.ac.uk

Introduction
This tutorial is about Information Retrieval (IR). It deals with two aspects of the information retrieval
task discussed in lectures: evaluating performance of IR systems, and methods for document ranking.
Note that these exercises are running concurrently with the Inf1-DA assignment. If you have questions
or difficulties with that, please ask your tutor about them during the tutorial.

Question 1: Evaluating an Information Retrieval System


Consider the following hypothetical information retrieval scenario. Suppose it has been found at
Edinburgh Royal Infirmary that due to equipment malfunction, the results of blood tests taken on
2013-12-04 are unreliable for diabetic patients. The hospital would like to contact all diabetic patients
who had any kind of blood test on that day, to repeat the test. The hospital uses an information
retrieval system to identify these patients. Suppose the collection of patients’ medical records contains
10000 documents, 150 of which are relevant to the above query. The system returns 250 documents,
125 of which are relevant to the query.

1
(a) Calculate the precision and recall for this system, showing the details of your calculations.

(b) Based on your results from (a), explain what the two measures mean for this scenario. How well
would you say that the hospital’s information IR system works?

(c) According to the precision-recall tradeoff, what will likely happen if an IR system is tuned to
aim for 100% recall?

(d) For the given scenario, which measure do you think is more important, precision or recall? Why?
Given your answer, what value would you give to the weighting factor α when calculating the
F-score measure for the hospital’s IR system?

? (e) Last semester, in Informatics 1: Computation and Logic, you encountered the properties of
soundness and completeness for a logic. Can you relate them to precision and recall of an
IR system?

Question 2: Ranking Documents


You are looking for information on the Economic Growth in Scotland in a large document col-
lection. You decide to search using the terms: economy, growth, Scotland, banks and business
using an information retrieval system and this recommends three possible documents. You are given
the frequency of each of the terms in each document, shown in the table below:
Terms economy Scotland growth banks business
Document 1 10 8 0 2 1
Document 2 0 0 9 9 8
Document 3 2 2 4 4 6
Query 1 1 1 1 1
You have no additional information about the documents; and to actually retrieve any one document
will cost money.

(a) One possible measure for determining which of the 3 documents is the cosine similarity measure,
which measures the cosine of the angle between the query vector and that of each document.
Compute this measure for each of the three documents.

(b) Based on your results of (a), which document is the best match for this query? Why?

(c) Do you agree with the results of this analysis? What are the strengths and weaknesses of cosine
measure?

2
Examples
This section contains further exercises on information retrieval. All are based on parts of past exam
papers. Following these there is a section presenting solutions and notes on all the examples.

Example 1
(a) What is the information retrieval task ? Give an example of such a task, indicating how it
matches your description.

(b) The performance of an information retrieval system can be evaluated in terms of its precision, P ,
and recall, R. Give an English-language definition of these two terms.

(c) Precision and recall are computed as follows:


TP TP
P = R=
TP + FP TP + FN
Name and define the three values TP , FP , FN appearing here.

(d) Two retrieval systems, X and Y, are being compared. Both are given the same query, applied
to a collection of 1500 documents. System X returns 400 documents, of which 40 are relevant to
the query. System Y returns 30 documents, of which 15 are relevant to the query. Within the
whole collection there are in fact 50 documents relevant to the query.
Tabulate the results for each system, and compute the precision and recall for both X and Y.
Show your working.

(e) Both precision and recall need to be taken into account when evaluating retrieval systems. Why
is it not sufficient to pick one and use only that?

(f ) The F -score is a measure which combines both measures using a weighting factor α, where high α
means that precision is more important. Give the formula defining the F -score for weighting α.

(g) How is F -score related to the harmonic mean?

(h) For the example task you gave in part (a), suggest an appropriate weighting factor α. Justify
your choice.

Example 2
Suppose you wish to find economic reports regarding the impact of oil extraction in the North Sea on
the Scottish economy. A commercial document retrieval service offers the following suggested matches:
the table shows how often some key phrases appear in each report.
North Sea oil Scotland economy
Report A 12 0 3 24
Report B 10 5 20 10
Report C 0 12 9 8
Query 1 1 1 1
Actually obtaining the reports will cost real money, so you would like to select the one most likely to
be relevant. Your task now is to assess this using the cosine similarity measure.

(a) Write out the general formula for calculating the cosine of the angle between two 4-dimensional
vectors (x1 , x2 , x3 , x4 ) and (y1 , y2 , y3 , y4 ).

(b) Use this formula to rank the three documents in order of relevance to the query according to
the cosine similarity measure. What do you think of the results?

3
Solutions to Examples
These are not entirely “model” answers; instead, they indicate a possible solution. Remember that
not all of questions necessarily have a single “right” answer. If you have difficulties with a particular
example, or have trouble following through the solution, please raise this as a question in your tutorial.

Solution 1
(a) The information retrieval task is to find those documents relevant to a user query from among
some large collection of documents.
For example, searching for previous legal rulings relevant to a certain topic from a judicial
archive. The judicial archive is the document collection; the query is some words related to the
topic; and the previous rulings are the relevant documents to be retrieved.
Other examples are possible, of course; but you would still need to identify the document col-
lection, the query, and which documents are relevant.

(b) Precision records what proportion of the documents retrieved do in fact match the query; recall
is the proportion of relevant documents in the collection which are successfully retrieved.
This kind of question is often referred to as “bookwork” — however, even though the required
information can indeed be found in books, it’s still important to be able to explain it clearly in
any given context.

(c) Here are the full names and definitions for the three terms.

• TP is True Positives, the number of relevant documents correctly returned.


• FP is False Positives, the number of irrelevant documents returned.
• FN is False Negatives, the number of relevant documents incorrectly rejected.

Note that this question asks you to both “name” and “define” the values, so it wouldn’t be
enough to say just “True Positives”: you need the definition as well.

(d) “Tabulate” means to exhibit in a table, so this question requires a table showing the results for
each system.

X Relevant Not relevant Total


Retrieved 40 360 400
Not retrieved 10 1090 1100
Total 50 1450 1500

Y Relevant Not relevant Total


Retrieved 15 15 30
Not retrieved 35 1435 1470
Total 50 1450 1500

40 15
System X precision P = = 0.1 System Y precision P = = 0.5
400 30
40 15
System X recall R = = 0.8 System Y recall R = = 0.3
50 50

(e) Depending on just one out of precision and recall can lead to extreme but unhelpful solutions.
A system that returns every document indiscriminately has 100% recall; while one that returns
only a single correct document is 100% precise. As information retrieval systems, the first is no
help at all, and the second is not much better.

4
(f ) Here is the formula for F -score in terms of α.
1
Fα =
α P1 + (1 − α) R1

(g) For α = 0.5 the F0.5 -score, or balanced F -score is the harmonic mean of precision and recall.

1 2P R
F0.5 = 1 1 1 1 =
2P + 2R
P +R

(h) For the retrieval of legal judgements, recall is of particular importance (you really don’t want to
miss anything), so value of α below 0.5, say 0.2, might be appropriate.
For other examples, either recall or precision might be more important, depending on the exact
choice of example.

Solution 2
(a) The cosine formula for 4-vectors is:
x1 y1 + x2 y2 + x3 y3 + x4 y4
cos(~x, ~y ) = p 2 p
x1 + x22 + x23 + x24 y12 + y22 + y32 + y42

It’s also possible to give a more compact presentation using vector notation:

~x.~y
cos(~x, ~y ) =
|~x||~y |

although that’s only useful if you are confident in how to then calculate the dot product and
modulus of 4-dimensional vectors.

(b) For the three reports listed, the appropriate calculation is the cosine between each report and
the original query.
12 + 3 + 24 39
cos(Report A, Query) = √ √ = = 0.72
2 2
4 12 + 3 + 24 2 54
10 + 5 + 20 + 10 45
cos(Report B, Query) = √ √ = = 0.90
2 2 2
4 10 + 5 + 20 + 10 2 50
12 + 9 + 8 29
cos(Report C, Query) = √ √ = = 0.85
4 122 + 92 + 82 34

The best fit is where the cosine is largest, closest to 1. This ranks the three documents in order
of similarity to the query as:

• Report B
• Report C
• Report A

These results seem reasonable: Report B is the only document which contains all the keywords;
while Report C does mention oil it doesn’t mention the North Sea specifically; and Report A
doesn’t mention oil at all. The superiority of C over A seems clear in the cosine measure, but I
don’t think it is altogether obvious from simply inspecting the numbers.
Notice that it’s not necessary to take the inverse cosine and compute the actual angle between
the vectors. The questions doesn’t ask for this. However if you did, then the best match would
be the smallest angle, closest to 0.

5
Homework 2
Page 110: Exercise 6.10; Exercise 6.12 
Page 116: Exercise 6.15; Exercise 6.17 
Page 121: Exercise 6.19 
Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 
Page 131: Exercise 7.3; Exercise 7.5; Exercise 7.8 
Page 144: Exercise 8.1 
Page 145: Exercise 8.3 
Page 150: Exercise 8.9 
Page 154: Exercise 8.10 
Page 167: Exercise 9.3 
Page 177: Exercise 9.7 
Page 211: Exercise 11.3 
Page 228: Exercise 12.6; Exercise 12.7

Exercise 6.10
Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 in
Figure 6.9. Compute the tf-idf weights for the terms car, auto, insurance, best, for
eachdocument, using the idf values from Figure 6.8.

 
Solution 
  Doc1  Doc2  Doc3 
car  44.55  6.6  39.6 
Auto  6.24  68.64  0 
Insurance  0  53.46  46.98 
Best  21  0  25.5 
 
 
Exercise 6.12
How does the base of the logarithm in (6.7) affect the score calculation in (6.9)? How
does the base of the logarithm affect the relative scores of two documents on a given
query?

Solution

For any base b > 0, idf b log ,

is a constant.

So changing the base affects the score by a factor , and the relative scores of

two documents on a given query are not affected.

Exercise 6.15
Recall the tf-idf weights computed in Exercise 6.10. Compute the Euclidean
normalized document vectors for each of the documents, where each vector has four
components, one for each of the four terms.

Solution
doc1 = [0.8974, 0.1257, 0, 0.4230]
doc2 = [0.0756, 0.7867, 0.6127, 0]
doc3 = [0.5953, 0, 0.7062, 0.3833]

Exercise 6.17
With term weights as computed in Exercise 6.15, rank the three documents by
computed score for the query car insurance, for each of the following cases of term
weighting in the query:
1. The weight of a term is 1 if present in the query, 0 otherwise.
2. Euclidean normalized idf.

Solution
1. q = [1, 0, 1, 0]
score(q, doc1)= 0.8974, score(q, doc2) = 0.6883, score(q, doc3) = 1.3015
Ranking: doc3, doc1, doc2
2. q = [0.4778, 0.6024, 0.4692, 0.4344]
score(q, doc1) = 0.6883, score(q, doc2) = 0.7975, score(q, doc3) = 0.7823
Ranking: doc2, doc3, doc1

Exercise 6.19
Compute the vector space similarity between the query “digital cameras” and the
document “digital cameras and video cameras” by filling out the empty columns in
Table 6.1. Assume N = 10,000,000, logarithmic term weighting (wf columns) for
query and document, idf weighting for the query only and cosine normalization for
the document only. Treat and as a stop word. Enter term counts in the tf columns.
What is the final similarity score?
Solution
Word Query document qi*di
tf wf df idf qi=wf-idf tf wf di=normalized
wf
digital 1 1 10,000 3 3 1 1 0.52 1.56
video 0 0 100,000 2 0 1 1 0.52 0
Cameras 1 1 50,000 2.3 2.3 2 1.3 0.68 1.56
Similarity score = 1.56+1.56 = 3.12

Exercise 6.20
Show that for the query affection, the relative ordering of the scores of the three
documents in Figure 6.13 is the reverse of the ordering of the scores for the query
jealous gossip.

Solution:
For the query affection, score(q, SaS) = 0.996, score(q, PaP) = 0.993, score(q, WH) =
0.847, so the order is SaS, PaP, WH.
For the query jealous gossip, score(q, SaS) = 0.104, score(q, PaP) = 0.12, score(q,
WH) = 0.72, so the order is WH, PaP,SaS.
So the latter case is the reverse order of the former case.

Exercise 6.23
Refer to the tf and idf values for four terms and three documents in Exercise 6.10.
Compute the two top scoring documents on the query best car insurancefor each of
the following weighing schemes: (i) nnn.atc; (ii) ntc.atc.

Solution
(i) nnn.atc
nnn weights for documents
Term Doc1 Doc2 Doc3
Car 27 4 24
Auto 3 33 0
Insurance 0 33 29
Best 4 0 17

Query Product
Term tf(augmented) idf tf-idf atc Doc1 Doc2 Doc3
weight
Car 1 1.65 1.65 0.56 15.12 2.24 13.44
Auto 0.5 2.08 1.04 0.353 1.06 11.65 0
Insurance 1 1.62 1.62 0.55 0 18.15 15.95
Best 1 1.5 1.5 0.51 7.14 0 8.67
Score(q, doc1) = 15.12 + 1.06 +0 + 7.14 = 23.32, score(q, doc2) = 2.24 + 11.65 +
18.15 + 0 = 32.04, score(q, doc3) = 13.44 + 0 + 15.95 + 8.67 = 38.06
Ranking: doc3, doc2, doc1

(ii) ntc.atc
ntc weight for doc1
Term tf(augmented) Idf tf-idf Normalized
weights
Car 27 1.65 44.55 0.897
Auto 3 2.08 6.24 0.125
Insurance 0 1.62 0 0
Best 14 1.5 21 0.423
ntc weight for doc2
Term tf(augmented) Idf tf-idf Normalized
weights
Car 4 1.65 6.6 0.075
Auto 33 2.08 68.64 0.786
Insurance 33 1.62 53.46 0.613
Best 0 1.5 0 0
ntc weight for doc3
Term tf(augmented) idf tf-idf Normalized
weights
Car 24 1.65 39.6 0.595
Auto 0 2.08 0 0
Insurance 29 1.62 46.98 0.706
Best 117 1.5 25.5 0.383

query Product
Term tf(augmented) idf tf-idf atc Doc1 Doc2 Doc3
weight
Car 1 1.65 1.65 0.56 0.502 0.042 0.33
Auto 0.5 2.08 1.04 0.353 0.044 0.277 0
Insurance 1 1.62 1.62 0.55 0 0.337 0.38
Best 1 1.5 1.5 0.51 0.216 0 0.19
Score(q, doc1) = 0.762, score(q, doc2) = 0.657, score(q, doc3) = 0.916
Ranking: doc3, doc1, doc2

Exercise 6.24
Suppose that the word coyote does not occur in the collection used in Exercises 6.10
and6.23. How would one compute ntc.atcscores for the query coyote insurance?

Solution
For the ntc weight, we compute the ntc weight of insurance.
For the atc weight, there is no need to compute, because the ntc weight for all
documents is 0 for coyote.

Exercise 7.3
If we were to only have one-term queries, explain why the use of global champion
lists with r = K suffices for identifying the K highest scoring documents. What is a
simple modification to this idea if we were to only have s-term queries for any fixed
integers >1?
Solution
1. We take the union of the champion lists for each of the terms comprising the
query, and restrict cosine computation to only the documents in the union set. If
the query contains only one term, we just take the list with r = K, because there is
no need to compute the union.
2. For each term, identify the highest scoring documents.

Exercise 7.5
Consider again the data of Exercise 6.23 with nnn.atcfor the query-dependent
scoring. Suppose that we were given static quality scores of 1 for Doc1 and 2 for
Doc2.Determine under Equation (7.2) what ranges of static quality score for Doc3
result in it being the first, second or third result for the query best car insurance.

Solution
Suppose the static quality score for Doc3 is g(doc3).
According to Exercise 6.23 and Equation 7.2, score(doc1, q) = 0.7627+ 1 = 1.7627,
score(doc2, q) = 0.6839 + 2 = 2.6839, score(doc3, q) = 0.9211 + g(doc3).
For Doc3 result in being:
(1) the first: 0.9211 + g(doc3) > 2.6839, we get g(doc3) > 1.7628
(2) the second: 1.7627< 0.9211 + g(doc3) < 2.6839, we get 0.8416 < g(doc3) <
1.7628
(3) the third: 0.9211+ g(doc3) <1.7627, we get 0 <=g(doc3) < 0.8416.

Exercise 7.8
The nearest-neighbor problem in the plane is the following: given a set of N data
points on the plane, we preprocess them into some data structure such that, given a
query point Q, we seek the point in N that is closest to Q in Euclidean distance.
Clearly cluster pruning can be used as an approach to the nearest-neighbor problem in
the plane, if we wished to avoid computing the distance from Q to every one of the
query points. Devise a simple example on the plane so that with two leaders, the
answer returned by cluster pruning is incorrect (it is not the data point closest to Q).

Solution

leader leader

closet point query
 
As is shown in the above picture, the right leader is closer to the query point than the
left leader, but the closest point belongs to the left group.

Exercise 8.1
An IR system returns 8 relevant documents, and 10 nonrelevant documents. There
are a total of 20 relevant documents in the collection. What is the precision of the
system on this search, and what is its recall?

Solution
Precision = 8/18 = 0.44
Recall = 8/20 = 0.4

Exercise 8.3
Derive the equivalence between the two formulas for F measure shown in Equation
(8.5), given that α 1/ β 1 .

1 1 β 1
F
1 β

Exercise 8.9
The following list of Rs and Ns represents relevant (R) and nonrelevant (N) returned
documents in a ranked list of 20 documents retrieved in response to a query from a
collection of 10,000 documents. The top of the ranked list (the document the system
thinks is most likely to be relevant) is on the left of the list. This list shows 6 relevant
documents. Assume that there are 8 relevant documents in total in the collection.
R R N NNNNN R N R N NN R N NNN R
a. What is the precision of the system on the top 20?
b. What is the F1 on the top 20?
c. What is the uninterpolated precision of the system at 25% recall?
d. What is the interpolated precision at 33% recall?
e. Assume that these 20 documents are the complete result set of the system. What
is the MAP for the query?
Assume, now, instead, that the system returned the entire 10,000 documents in a
ranked list, and these are the first 20 results returned.
f. What is the largest possible MAP that this system could have?
g. What is the smallest possible MAP that this system could have?
h. In a set of experiments, only the top 20 results are evaluated by hand. The result
in (e) is used to approximate the range (f)–(g). For this example, how large (in
absolute terms) can the error for the MAP be by calculating (e) instead of (f) and
(g) for this query?

Solution
a. Precision = 6/20 = 0.3
b. Recall = 6/8 = 0.75
2 3
0.43
7
c. 8*0.25 = 2, the uninterpolated precision could be 1, 2/3, 2/4, 2/5, 2/6, 2/7, 1/4
d. Because the highest precision found for any recall level larger than 33% is 4/11 = 
0.364, hence the interpolated precision at 33% recall is 4/11 = 0.364. 
e. MAP = 1/6 * (1 + 1 + 3/9 + 4/11 + 5/15 + 6/20) = 0.555 
f.  MAP ∗ 1 1 0.503  

g.  MAP ∗ 1 1 0.417  


h. 0.555‐0.417 = 0.138,    0.555‐0.503 = 0.052 
the error is in [0.052, 0.138] 
 
Exercise 8.10
Below is a table showing how two human judges rated the relevance of a set of 12
documents to a particular information need (0 = nonrelevant, 1 = relevant). Let us
assume that you’ve written an IR system that for this query returns the set of
documents
{4, 5, 6, 7, 8}.
docID Judge 1 Judge 2
100
200
311
411
510
610
710
810
901
10 0 1
11 0 1
12 0 1
a. Calculate the kappa measure between the two judges.
b. Calculate precision, recall, and F1 of your system if a document is considered
relevant only if the two judges agree.
c. Calculate precision, recall, and F1 of your system if a document is considered
relevant if either judge thinks it is relevant.
 
Solution
(a)
P(A) = 4/12 = 1/3
P(nonrelevant) = (6+6)/(12+12) = 0.5, P(relevant) = (6+6)/(12+12) = 0.5
P(E)=0.5*0.5 + 0.5*0.5 = 0.5
Kappa = (P(A)-P(E))/(1-P(E)) = -1/3
(b)
Precision = 1/5 = 0.2
Recall = 1/2 = 0.5
F1 = 2*0.2*0.5/(0.2+0.5) = 2/7 = 0.286
(c )
Precision = 5/5 = 1
Recall = 5/10 = 0.5
F1 = 2*1*0.5/(1+0.5) = 2/3 = 0.667
 
The Exercise 9.3 on the book and the pdf edition are different. 
Exercise 9.3 (pdf edition)
Under what conditions would the modified query qm in Equation 9.3 be the same as 
the original query q0? In all other cases, is qm closer than q0 to the centroid of the 
relevant documents? 
 
Solution 
 
1 1
 
| | | |
∈ ∈

One possible condition is as follows: 
α 1,  ∑ ∈
∑ ∈
, . 

 
No, if  β  is very small and  γ  is very large,    q   might be closer to the centroid of 
the relevant documents than  q   . 
 
Exercise 9.3 (book)
Suppose that a user’s initial query is cheap CDs cheap DVDs extremely cheap CDs.
The user examines two documents, d1 and d2. She judges d1, with the content CDs
cheap software cheap CDs relevant and d2 with content cheap thrills DVDs
nonrelevant. Assume that we are using direct term frequency (with no scaling and no
document frequency). There is no need to length-normalize vectors. Using Rocchio
relevance feedback as in Equation (9.3) what would the revised query vector be after
relevance feedback? Assume alpha= 1, beta= 0.75, gamma= 0.25.
 
Solution 
  cheap  CDs  DVDs  extremely software  thrills 
q0  3  2  1  1  0  0 
d1  2  2  0  0  1  0 
d2  1  0  1  0  0  1 
Using Rocchio algorithm, qm=q0 + 0.75*d1‐0.25*d2. Negative weights are set to 0. 
qm  4.25  3.5  0.75  1  0.75  0 
 
 
Exercise 9.7
If A is simply a Boolean ”incidence” matrix, then what do you get as the entries in
C?

Solution
, is the number of documents containing both term u and term v.
By Boolean co-occurrence matrix, it means the element At,d,term t appears in
document d. As C is the similarity score, we increase Cu,v by one if term u and term v
both appear in the document .
If , , 1, , , 1.

Exercise 11.3
Let Xt be a random variable indicating whether the term t appears in a document.
Suppose we have |R| relevant documents in the document collection and that Xt = 1
ins of the documents. Take the observed data to be just these observations of Xt for
each document in R. Show that the MLE for the parameter pt = P(Xt= 1|R = 1,q),
that is, the value for pt which maximizes the probability of the observed data, is
pt = s/|R|.

Solution
P D|t ∏ ∈ | 1 , where 1| 1, , n is the
number of documents containing the term, and N is the total number of documents.
The log-likelihood: log P(D|t) = n log + (N-n) log
Let the partial derivative equal to 0, 0.

We get , so p /| |.

Exercise 12.6
Consider making a language model from the following training text:
The martian has landed on the latin pop sensation ricky martin
a. Under a MLE-estimated unigram probability model, what are P(the) and
P(martian)?
b. Under a MLE-estimated bigram model, what are P(sensation|pop) and P(pop|the)?

Solution
a. P(the) = 2/11, P(martian) = 1/11
b. P(sensation|pop) = 1, P(pop|the) = 0

Exercise 12.7
Suppose we have a collection that consists of the 4 documents given in the below
table.
docID Document text
1 click go the shears boys click click click
2 click click
3 metal here
4 metal shears click here
Build a query likelihood language model for this document collection. Assume a
mixture model between the documents and the collection, with both weighted at 0.5.
Maximum likelihood estimation (mle) is used to estimate both as unigram models.
Work out the model probabilities of the queries click, shears, and hence click shears
for
each document, and use those probabilities to rank the documents returned by each
query. Fill in these probabilities in the below table:
Query Doc 1 Doc 2 Doc 3 Doc 4
click
shears
click shears
What is the final ranking of the documents for the query click shears?

Solution
Language models
click go the shears boys metal here
model1 1/2 1/8 1/8 1/8 1/8 0 0
model2 1 0 0 0 0 0 0
model3 0 0 0 0 0 1/2 1/2
model4 1/4 0 0 1/4 0 1/4 1/4
Collection 7/16 1/16 1/16 2/16 1/16 2/16 2/16
model

model probabilities of the queries


query doc1 doc2 doc3 doc4
click 0.4688 0.7188 0.2188 0.3438
shears 0.125 0.0625 0.0625 0.1875
click shears 0.0586 0.0449 0.0137 0.0645

Final ranking for the query click shears: doc4 > doc1 > doc2 > doc3
Web Mining

Exercises
Mauro Brunato, Elisa Cilia

May 18, 2011

Exercise 1
A corpus contains the following five documents:
d1 To be or not to be, this is the question!
d2 I have a pair of problems for you to solve today.
d3 It’s a long way to Tipperary, it’s a long way to go. . .
d4 I’ve been walking a long way to be here with you today.
d5 I am not able to question these orders.
The indexing system only considers nouns, adjectives, pronouns, adverbs and verbs. All forms are converted to
singular, verbs are converted to the infinitive tense, removes all punctuation marks and translates all letters to
uppercase. Conjunctions, prepositions, articles and exclamations are discarded as well. Multiple occurrences of
the same term within a document are not counted.
For instance, the phrase
Hey, it’s not too late to solve these exercises!
becomes
IT BE NOT TOO LATE SOLVE THIS EXERCISE
1.1) What is the minimum dimension (number of coordinates) of the TFIDF vector space for this collection of
documents?
1.2) Fill the 5 × 5 matrix of Jaccard coefficients between all pairs of documents.
1.3) Apply an agglomerative clustering procedure to the collection. as a measure of similarity between two
clusters D1 and D2 , consider the highest similarity between d1 and d2 , with d1 ∈ D1 and d2 ∈ D2 .
1.4) Draw the resulting dendrogram.

Solution — The stripped-down documents are the following (the third columns count the number of different terms
in each document, just to ease up the calculation of the Jaccard coefficient):

d1 BE NOT THIS QUESTION 4


d2 I HAVE PAIR PROBLEM YOU SOLVE TODAY 7
d3 IT BE LONG WAY TIPPERARY GO 6
d4 I HAVE BE WALK LONG WAY HERE YOU TODAY 9
d5 I BE NOT ABLE QUESTION THIS ORDER 7

1.1) The collection includes 20 different terms: ABLE, BE, GO, HAVE, I, IT, HERE, LONG, NOT, ORDER, PAIR,
PROBLEM, QUESTION, SOLVE, THIS, TIPPERARY, TODAY, WALK, WAY, and YOU. Therefore, the vector represen-
tation requires at least 20 dimensions.
1.2) The table of Jaccard coefficients is the following. Only the upper triangular part is shown, since the Jaccard
coefficient is symmetrical.

d1 d2 d3 d4 d5
d1 1 0 1/9 1/12 4/7
d2 1 0 1/3 1/13
d3 1 1/4 1/12
d4 1 1/7
d5 1

1.3) The two most similar documents are d1 and d5 , so they can be joined in the same partition. The similarity matrix
becomes:
{d1 , d5 } {d2 } {d3 } {d4 }
{d1 , d5 } 1 1/13 1/9 1/7
{d2 } 1 0 1/3
{d3 } 1 1/4
{d4 } 1

After this step, singletons {d2 } and {d4 } are most similar, and shall be joined:

{d1 , d5 } {d2 , d4 } {d3 }


{d1 , d5 } 1 1/7 1/9
{d2 , d4 } 1 1/4
{d3 } 1

Next, singleton d3 joins cluster {d2 , d4 }:

{d1 , d5 } {d2 , d3 , d4 }
{d1 , d5 } 1 1/7
{d2 , d3 , d4 } 1

Finally, the two remaining clusters can be merged together. The corresponding dendrogram is the following:

1 5 2 4 3

Exercise 2
In the same setting as in the previous exercise, estimate the Jaccard coefficient for all document pairs based on
the application of five random permutations.
Exercise 3
Let D be a set of documents over the set T of terms, ntd counts the number of occurrences of term t in document
d.
3.1) Consider the following term frequency measures:
(
1 if ntd 6= 0 ntd
A1 (t, d) = ntd , A2 (t, d) = A3 (t, d) = , A4 (t, d) = log(1 + ntd ).
0 otherwise, |d|

Consider each measure according to each of the following criteria separately:


1. The size of a document should not matter (e.g., concatenating two copies of the same document should
not change the measure).
2. The number of occurrences of the term should not matter, only its presence is important.
3. Increasing the number of occurrences of a term should have a lesser impact on the measure if the term is
already frequent.
3.2) Which of the following are suitable IDF functions, and why?
 !−1  !−1
X X
B1 (t) = − log 1 − A1 (t, d) , B2 (d) = 1+ A2 (t, d) ,
d∈D t∈T

!−1
X 1 X
B3 (t) = , B4 (d) = A4 (t, d)
1 + A1 (t, d)
d∈D t∈T

Exercise 4
A document retrieval system must be implemented in a structured programming language (Java, C, C++). Doc-
uments and terms are represented with their numeric IDs.
4.1) Define the appropriate array and record structures to efficiently store the matrix ntd counting the number of
occurrences of each term t in each document d, considering that it is very sparse. Define the structure to store
inverse document frequency values.
4.2) Write a function retrieve(q) which, given the array q of term indices, returns an array with the IDs of
the five nearest documents according to the cosine measure in the TFIDF space.

Exercise 5
An information retrieval system manages a corpus of six documents. Given the query q, the system computes
the following probabilities for the documents to be relevant:

i 1 2 3 4 5 6
pi 100% 80% 20% 80% 0 100%

5.1) What strategy can the system adopt in order to maximize its recall score? What strategy can maximize its
precision score?
5.2) Suppose that the only documents that are relevant with respect to query q are 1, 2, 4 and 6 (of course, the
system does not know this). The system implements two alternative algorithms:
1. let document i appear in the returned list iff pi = 100%, or
2. let document i appear in the list with probability pi .
Compute the expected values of precision and recall assigned by the user (who knows the actual document
relevance) to the list of documents returned by each algorithm.
Hint — Note that algorithm (1) is deterministic, only algorithm (2) is stochastic.

Solution —
5.1) Let r = (ri ), where ri is the “true” relevance of document i (remember that the query is fixed). Let x = (xi ),
where xi = 1 iff the IR system returns document i in response to the query. Then,
x·r x·r
Precisionr (x) = 6
, Recallr (x) = 6
.
X X
xi ri
i=1 i=1

In other words, the “precision” of the answer is the amount of relevant documents within the list provided by the IR system.
Its maximum value is attained when all returned documents are relevant, so we need to return only the two documents, 1
and 6, which are certainly relevant to the user. The “recall” of the answer is its property of containing as many relevant
documents as possible, and it is maximized by returning all documents (with the possible exception of 5, which is irrelevant
for sure).
5.2) In the first case, the IR system provides a deterministic answer, having precision 100% and recall 50%. In the
second case, we need to compute precision and recall scores for all possible return strings, and compute their probability-
weighted average:
X X
E(Precision) = Pr(x) Precisionr (x), E(Recall) = Pr(x) Recallr (x).
x x

Note that documents 1 and 6 are always returned, while document 5 is never returned; moreover, documents 2 and 4 are
indistinguishable, so we can determine the following table, where precision (left) and recall (right) scores are provided
together with their probabilities (in parentheses).

x2 + x4
0 1 2
(.04) (.32) (.64)
2 2 3 3 4 4
0 (.8) 2 4 3 4 4 4
(.032) (.256) (.512)
x3 2 2 3 3 4 4
1 (.2) 3 4 4 4 5 4
(.008) (.064) (.128)

Finally,
2 3 4
E(Precision) = .8 + · .008 + · .064 + · .128 ≈ .8 + .005 + .048 + .102 ≈ 96%,
3 4 5
2 3 4
E(Recall) = · .04 + · .32 + · .64 = .02 + .24 + .64 = 90%.
4 4 4

Exercise 6
With the same data of Exercise 5, suppose that the system uses algorithm (1).
6.1) Compute the expected precision and recall scores from the point of view of the IR system, who only knows
the probabilities pi for document i to be relevant.

Solution — In this case the IR system’s answer is known, but the actual document relevance is a random variable with
the given probabilities. Therefore, the average values must be computed against probabilities of the unknown r:
X X
Er (Precision) = Pr(r) Precisionr (x), Er (Recall) = Pr(r) Recallr (x).
r r

We know the answer x of the IR system, which is (1, 0, 0, 0, 0, 1), therefore we can compute a table which is similar to that
of Exercise 5:

x2 + x4
0 1 2
(.04) (.32) (.64)
2 2 2 2 2 2
0 (.8) 2 2 2 3 2 4
(.032) (.256) (.512)
x3 2 2 2 2 2 2
1 (.2) 2 3 2 4 2 5
(.008) (.064) (.128)

Therefore, as expected,
Er (Precision) = 100%
because we are sure that only relevant documents are returned. On the other hand,
2 2 2 2 2 2
Er (Recall) = ·.032+ ·.256+ ·.512+ ·.008+ ·.064+ ·.128 ≈ .032+.171+.256+.005+.032+.051 ≈ 55%.
2 3 4 3 4 5
Exercise 7
Write in your favorite high-level language a function that implements the FastMap algorithm. In particular,
define what input must be provided and which output shall be returned.

Solution — Let matrix d be the input data (mutual distances between couples of items). The matrix is symmetric,
so many optimizations are possible. Let x be the output matrix with one column per document and one row per extracted
coordinate. We assume that the number of documents n and the number of extracted dimensions m are encoded into matrix
sizes; otherwise, we can pass them as two additional integer parameters.

1. void FastMap (double d[][], double x[][])


2. {
3. int n = d.length, m = x.length;
4. for ( int s = 0; s < m; s++ ) { Repeat for the desired number of coordinates
5. i,j ← arg maxi,j d[i][j]; Find the two farthest points
6. for ( int k = 0; k < n; k++ ) Compute the s-th coordinate
d[i][k]2 + d[i][j]2 − d[k][j]2
7. x[s][k] ← ;
2d[i][j]
8. for ( int i1 = 0; i1 < n; i1 ++ ) Recompute the mutual distances
9. for ( j1 = 0; j1 < n;
pj1 ++ )
10. d[ i1 ][ j1 ] ← d[i1 ][j1 ]2 − (x[s][i1 ] − x[s][j1 ])2 ;
11. }
12. }

Note that the term within the square root sign at line 10 might be negative, so a bit of care must be taken when actually
implementing the algorithm. . .

Exercise 8
The columns of the following matrix represent the coordinates of a set of documents in a TFIDF space:
 
2 0 2
1 0 1 0
A= √  
6 2 1 2
2 −1 2

Let document similarity be defined by the cosine measure (dot product).


8.1) Compute the rank of matrix A.
8.2) Let q = (1, 3, 0, −2)T be a query. Find the document in the set that best satisfies the query.
8.3) Given the matrices  
1 0  
1 0
0 1  1
U = α 1 1  ,
 V = √ 0 1
2 1 0
1 −1
determine coefficient α and the diagonal matrix Σ so that U is column-orthonormal and A = U ΣV T .
8.4) Project the query q onto the LSI space defined by this decomposition and verify the answer to question 8.2.
Why isn’t the requirement that V be column-orthonormal important in our case?
8.5) Suppose that we want to reduce the LSI space to one dimension. Show how the new approximate document
similarities to q are computed.

Solution —
8.1) Notice that A has two linearly dependent (actually equal) columns (thus rk A < 3), while the first two columns are
independent (thus rk A ≥ 2), therefore rk A = 2.
8.2) Similarities are computed by dot products, let’s do it in a single shot for all documents:
0 1
−2
1
AT q = √ @ 5 A ;
6 −2

The most similar is document 2.



8.3) The column normality condition for matrix U implies β = 1/ 3. By expliciting the calculation of some entries of
matrix A, we obtain „ «
2 0
Σ= .
0 1
8.4) Projection onto the document LSI space is achieved via Σ−1 U T :
„ «
1 −1/2
q̂ = Σ−1 U T q = √ .
3 5

Similarity to the documents is computed via the V Σ2 matrix. If all computations are right,

V Σ2 q̂ = AT q.

Exercise 9
Specify in the MapReduce framework the Map and Reduce functions to find the number of occurrences of
one/more given pattern/s in a collection of documents.

Solution — Let us define the two functions.


Map: N × T∗ −→ (T × N)∗
(offset, line) 7→ [(match, 1)]
Reduce: T × N∗ −→ (T × N)∗P
(match, [n1 , . . . , nk ]) 7→ [(match, ni )]

Function Map receives a key (related to the document ID or line offset), which we can disregard, and a sequence of terms
(a line or a full document). It gives as output a list of pairs (match, 1), one for each match of the pattern in the received
value.
Function Reduce takes as input a pair (match, [n1 , . . . , nk ]) where the value part is a list of previously computed
occurrences (originally all 1’s) and returns the list of matching patterns (only one element in this case) with the number of
occurrences for each match.
The pseudo-code for the Map and Reduce functions is the following:

1. map (offset, line)


2. while pattern.matches (line)
3. emit (pattern, 1);

1. reduce (match, values) values is an iterator over counts


2. result = 0;
3. for each v in values
4. result += v;
5. emit (match, result);
Exercise 10
Given the following relevance ranking vector in response to a query q:

d1 , d2 , d3 , d4 , d5 , d6

(the underlined documents are exactly all the relevant ones)


10.1) Determine the interpolated precision at level ρ = 0.5 of recall,
10.2) Determine the “global” F1 − measure (for the system returning all the six documents),
10.3) Determine the Break Even Point (BEP), which is the point of equivalence between (interpolated) precision
and recall.

Solution — 10.1) We are given a ranked list of documents returned in response to a query q with their associated relevance
values. In this ranked retrieval context, precision and recall can be computed by considering as the set of retrieved documents
the top k ranked documents:

k rk R P
0 0 0 1
1 0 0 0
2 1 0.25 0.5
3 1 0.5 0.66
4 0 0.5 0.5
5 1 0.75 0.6
6 1 1 0.66

The interpolated precision Pinterp at a certain level ρ of recall is defined as the highest precision found for any recall level
ρ′ ≥ ρ:

Pinterp (ρ) = max



P (ρ′ )
ρ ≥ρ

Thus the interpolated precision at level ρ = 0.5 of recall is Pinterp (0.5) = 0.66
10.2) The F1 − measure = 2×P ×R
P +R
when all the documents are returned is F1 = 0.795 (by taking the P and R values
computed in the last row of the table).
10.3) To find the BEP we plot the interpolated precision curve:

0.8

BEP=0.66
0.6
Precision

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
Recall
Exercise 11
The singular value decomposition of a term-document matrix A = U ΣV T is

    1 0 −1
1 −1 0 3 0 0
1 1 1 1 1
U = √ 1 1 √0  , Σ = 0 2 0 , V =√  
2 0 0 2 0 0 1 3 0 1 −1

1 −1 0
11.1) What is the rank of the matrix A?
11.2) Perform a reduction of the LSI space by one dimension.
11.3) Given the new representation of matrix Â, apply an agglomerative clustering procedure to the collection.
Merge the clusters at each step according to the self-similarity measure by using as a measure of inter-document
similarity simply the dot-product hd1 , d2 i.
11.4) Draw the resulting dendrogram. How many clusters can you find at a level of similarity of 2?
11.5) Check the clustering results you get by cutting across the dendrogram, by plotting them.

Solution —
11.1) Matrix A was originally a 3 × 4 matrix. The three elements in the diagonal matrix Σ are non-null, therefore matrix
AT A (hence, matrix A) has full rank (3).
11.2) Let us obtain Û , V̂ and Σ̂ by removing the third column from U and V , and the third row and column from Σ,
corresponding to the smallest eigenvalue of AT A:
0 1
1 −1 „ « „ «
1 @ 3 0 1 1 1 0 1
Û = √ 1 1 A, Σ̂ = , V̂ = √ .
2 0 0 2 3 0 1 1 −1
0

11.3) After rank reduction, we can compute the similarity by


0 1
1 0 „ «„ «
1 B1 1C C 9 0 1 1 0 1
ÂT Â = V̂ Σ̂2 V̂ T = B
3 @0 1A 0 4 0 1 1 −1
1 −1

Therefore, we obtain the following table of unnormalized dot product similarities:

2 3 4
1 3 0 3
4 5
2 3 3
3 − 43

11.4) {1, 2} and {1, 4} are both candidates as the first cluster. Let us choose the first pair. Therefore, at level 3 the first
clustering step yields

3 4
13 23
{1, 2} 9 9
3 − 34
23
Now, the highest self-similarity value is achieved by cluster {1, 2, 4} at level 9
, so that the similarity matrix becomes

3
23
{1, 2, 4} 18

Therefore, at similarity level 2 we have two clusters: {1, 2, 4} and 3.


11.5) The corresponding dendrogram is
1 2 4 3

23/9

23/18
Exercise 12
Consider the query: “love Mary” and the set of returned documents of a information retrieval system:
d1: John gives a book to Mary.
d2: John who reads a book loves Mary.
d3: Whom does John think Mary loves?
d4: John thinks a book is a good gift.
12.1) Define the set of keywords and give documents the corresponding representation, after applying the stop
word elimination and the stemming processes. Assume that we are using direct term frequency (with no scaling
and no document frequency). Do not normalize vectors.
12.2) Suppose that document 2 has been judged as relevant, and document 4 as not-relevant. Using Rocchio
relevance feedback what would the revised query vector be after relevance feedback?
Assume α = 1, β = 0.5, γ = 0.5.

Exercise 13
A large set of documents, each containing a large number of terms, is given. The aim of this exercise is to create
an index that maps each term to the document where it occurs in the earliest position (ties may be broken at
will). For example, given the three following documents:
Filename Content
random.doc Zigzag bumblebee slash acorn
nonsense.txt Bumblebee acorn zigzag slash
useless.pdf Acorn dot bumblebee slash zigzag
the index should be:

acorn 7→ useless.pdf
bumblebee 7→ nonsense.txt
dot →
7 useless.pdf
slash → 7 random.doc
zigzag 7→ random.doc

In fact, the word “bumblebee” appears in position 2 of file random.doc, in position 1 of file nonsense.txt
and in position 3 of file useless.pdf, therefore it is mapped to nonsense.txt.
13.1) Outline a MapReduce-based solution to the problem. In particular, describe the input and output records
of the mapper and reducer functions.
13.2) Could a combiner be useful? Provide a short motivation to your answer.
13.3) Implement the mapper and the reducer; assume that the data are already tokenized and use any language
or high-level pseudo-code.
Exercise 14
A directed graph is a pair G = (V, E), where V , the vertex set is a finite set of terms, and E ⊆ V × V is the
edge set. The indegree of a vertex v ∈ V is the number of incoming edges (the cardinality of E ∩ (V × {v})),
while its outdegree is the number of outgoing edges (the cardinality of E ∩ ({v} × V )).
An edge in E can be represented by a line of text containing two terms (the first is the origin, the second the
destination of the edge). The edge set E is therefore represented by a collection of lines of text.
Given a collection of lines describing the edge set E (either coming from a single file or split among several
files), we want to design a MapReduce system to produce a list of vertices, each associated with a pair of integers
representing their indegree and outdegree.
For example, consider the set of lines on the left. The resulting mapping is shown on the right.

lorem 7→ (0, 2)
lorem ipsum dolor sit ipsum 7→ (2, 1)
amet ipsum lorem dolor ⇒ dolor 7→ (1, 2)
ipsum sit dolor amet sit 7→ (2, 0)
amet 7→ (1, 1)

The corresponding graph is the following:


ipsum

lorem dolor

amet sit

14.1) What are the domain and codomain of the Map and Reduce functions? Is it possible to use the Reduce
function as a combiner as well?
14.2) Write a pseudo-code implementation of the relevant functions.

Exercise 15
Below is a table showing how a human judge rated the relevance of a set of 15 documents with respect to a
particular information need (0 = nonrelevant, 1 = relevant).

docID relevance docID relevance docID relevance


1 0 6 1 11 1
2 1 7 0 12 1
3 1 8 0 13 1
4 1 9 0 14 1
5 0 10 1 15 1

Let us assume that two different information retrieval engines (S1 and S2) compute for this query the following
rankings:
S1: (5, 8, 9, 1, 3, 4, 2, 10, 12, 13, 15, 6, 7, 14, 11) and S2: (7, 8, 1, 10, 12, 2, 3, 5, 13, 15, 14, 4, 6, 9, 11).
15.1) Does intuition suggest that one of the two IR system is better than the other?
Show that your guess is right by analysing the performance of the two systems and by comparing them.
Use the most suitable performance measures and methods among those we have seen during the course, giving
as much evidence as you can.
15.2) Suppose the IR engine can return only the first 8 documents to the user. Compare the performance of the
systems in this case.
Which IR system is the best? Justify your answer.
Exercise 16
Consider the following set of documents, where the vocabulary is composed of three words and we have two
categories A and B:
(5, 6, 0) 7→ A
(2, 1, 3) 7 → A
(7, 7, 0) 7 → A
(2, 2, 5) 7 → A
(0, 8, 4) 7 → B
(2, 0, 8) 7 → B
(7, 1, 3) 7 → B
16.1) Perform one iteration of the k-means algorithm assuming that the initial clustering corresponds to the
provided categorization. Show the final clustering.
16.2) Suppose that the document set is used as a training set for a supervised k-Nearest-Neighbors classifier.
Given the new document (4, 4, 1), how would the classifier categorize it for k = 1? What for k = 3?

Exercise 17
Below is a table showing how two human judges rated the relevance of a set of 12 documents to a particular
information need (0 = nonrelevant, 1 = relevant).
Let us assume that you have written an information retrieval engine that for this query returns the set of docu-
ments {4, 5, 6, 7, 8}.

docID Judge1 Judge2


1 0 0
2 0 0
3 1 1
4 1 1
5 1 0
6 1 0
7 1 0
8 1 0
9 0 1
10 0 1
11 0 1
12 0 1

17.1) Calculate the precision and recall of your system if a document is considered relevant only if both judges
agree.
17.2) Calculate the precision and recall of your system if a document is considered relevant if either judge thinks
it is relevant.
17.3) Suppose the documents are returned by your IR engine in the ID order as in the table.
a. Plot the Precision versus Recall graph for the first case (a document is considered relevant only if both
judges agree) when varying the number of documents returned (1 document returned, 2 documents re-
turned, etc).
b. Determine the interpolated precision at level ρ = 0.5 of recall.
c. How many documents should the system return in order to maximize its performance? Justify your
answer.
Exercise 18
The network of references for a set of five hypertexts is given in figure:

1 3 4

2 5

Compute the first 5 iterations of the PageRank and HITS algorithms in the following hypotheses:
• No damping factor.
• Initial PageRank vector gives probability 1 to node 1.

• Initial hub and authority vectors are uniformly 1 over all nodes.
• No normalization required.

Exercise 19
The network of references for a set of four hypertexts is given in figure:

1 4

2 3

19.1) Execute the first four steps of the PageRank algorithm starting from user being with certainty at node 1
(no damping factor).
19.2) Compute the stationary PageRank scores of the documents.

Exercise 20
Suppose that a query, executed on the same network as Exercise 19, returns nodes 1 and 2, as the root set and
that we want to use the HITS algorithm in order to rank the pages.
20.1) Define the expanded set and the base set for the given query.
20.2) Compute the first five iterations of the HITS algorithm for the base set.
20.3) Which hub and authority values will asymptotically dominate?

Exercise 21
Let a hypertext system be a complete bipartite graph with 3 hubs and 2 authorities.
21.1) For every node in the system, draw a link from the node to itself. Write the adjacency matrix of the system,
and normalize it for use with the PageRank algorithm.
21.2) What is the PageRank score of the nodes in the system? Provide both an analytical proof and an intuitive
explanation. Assume no damping factor.
21.3) Now add a link from one of the authorities to one of the hubs. What is the PageRank score of the nodes
now? Provide both an analytical proof and an intuitive explanation.

Solution — 21.1) The requested graph is the following:


The adjacency matrix, assuming that hubs are the first three nodes, and its normalized version is
0 1 0 1
1 0 0 1 1 1/3 0 0 1/3 1/3
B0 1 0 1 1C B 0 1/3 0 1/3 1/3C
B C B C
E=B B 0 0 1 1 1 C
C L = B 0
B 0 1/3 1/3 1/3CC
@0 0 0 1 0A @ 0 0 0 1 0 A
0 0 0 0 1 0 0 0 0 1

21.2) We must find the principal eigenvector, corresponding to eigenvalue 1 of matrix E T :

LT v = v

Let us make the system explicit:


1
v1 = v1
3
1
v2 = v2
3
1
v3 = v3
3
1
v4 = (v1 + v2 + v3 ) + v4
3
1
v5 = (v1 + v2 + v3 ) + v5
3
Therefore, principal eigenvectors are of the form (0, 0, 0, v4 , v5 ). The vector must be normalised, so that v4 + v5 = 1, and
by symmetry considerations we get the final score: (0, 0, 0, 1/2, 1/2).
By intuition, after one step (at most) the user will be trapped in one of the authorities, and will never go back; thus, the
PageRank score of the hubs is 0 (after a transient period the user will never visit them). By symmetry, the probability of the
user being in any authority is equal.
21.3) The graph becomes:

corresponding to the following adjacency matrix (on the right, the normalized version):
0 1 0 1
1 0 0 1 1 1/3 0 0 1/3 1/3
B0 1 0 1 1C B 0 1/3 0 1/3 1/3C
E′ = B L′ = B
B C B C
B0 0 1 1 1C B 0 0 1/3 1/3 1/3C
C
C
@1 0 0 1 0A @1/2 0 0 1/2 0 A
0 0 0 0 1 0 0 0 0 1

The eigenvector equation


L′T v = v
becomes:
1 1
v1 = v1 + v4
3 2
1
v2 = v2
3
1
v3 = v3
3
1 1
v4 = (v1 + v2 + v3 ) + v4
3 2
1
v5 = (v1 + v2 + v3 ) + v5
3
Therefore, principal eigenvectors are of the form (0, 0, 0, 0, v5 ), and by normalization we get the final score: (0, 0, 0, 0, 1).
By intuition, sooner or later the user will be trapped in the pure authority, and will never go out.

Exercise 22
Let V1 and V2 be two finite sets. Then the set of edges in a complete directed bipartite graph having V1 as source
nodes and V2 as destination nodes is the Cartesian product of the two sets:

V1 × V2 = {(i, j) : i ∈ V1 ∧ j ∈ V2 }

Let us define graph G = (V, E) where:

V = {1, . . . , 12}
     
E = {1, 2, 3} × {4, 5, 6} ∪ {5, 6} × {7, 8} ∪ {9, 10} × {11, 12} .

The three subsets of E identify three bipartite components of G:


 
G1 = {1, . . . , 6}, {1, 2, 3} × {4, 5, 6}
 
G2 = {5, . . . , 8}, {5, 6} × {7, 8}
 
G3 = {9, . . . , 12}, {9, 10} × {11, 12}

Note that the three components are not disjoint, but the graph is not connected.
For every node n, according to the HITS scoring system, let h(n) be its hub score and let a(n) be its authority
score. Moreover, if B = (VB , EB ) is a bipartite graph, its importance I(B) as the sum of hub scores of its
source nodes plus the sum of the authority scores of its destination nodes:
X X
I(B) = h(i) + a(i).
i:∃j(i,j)∈EB i:∃j(j,i)∈EB

22.1) Which bipartite component (among G1 , G2 and G3 ) will asymptotically achieve the maximum impor-
tance, and why?
22.2) Simulate three iterations of the HITS system starting with a uniform value of 1 to all hub and authority
scores. What is the importance of each bipartite component, at the end?
22.3) If the edge (3, 9) is added to G, how do you expect the importance scores of the three components to
change, and why?
22.4) With the further addition of edge (10, 3) to the graph, how do you expect the importance scores of the
three components to change, and why?

Solution — The initial graph is the following (the three bipartite components are also shown):
G1
1 4
G2

2 5 7

3 6 8

G3
9 11

10 12

22.1) The HITS ranking system favors the largest bipartite component, which corresponds to the principal eigenvector
of E T E. Therefore, component G1 will asymptotically prevail.
22.2) Authority scores:

Node 1 2 3 4 5 6 7 8 9 10 11 12
Initial value 1 1 1 1 1 1 1 1 1 1 1 1
Step 1 0 0 0 3 3 3 2 2 0 0 2 2
Step 2 0 0 0 9 9 9 4 4 0 0 4 4
Step 3 0 0 0 27 27 27 8 8 0 0 8 8

Hub scores:
Node 1 2 3 4 5 6 7 8 9 10 11 12
Initial value 1 1 1 1 1 1 1 1 1 1 1 1
Step 1 3 3 3 0 2 2 0 0 2 2 0 0
Step 2 9 9 9 0 4 4 0 0 4 4 0 0
Step 3 27 27 27 0 8 8 0 0 8 8 0 0
Hub scores:
Component G1 G2 G3
Initial value 6 4 4
Step 1 18 8 8
Step 2 54 16 16
Step 3 162 32 32
22.3) After the new edge, the graph is the following:

1 4

2 5 7

3 6 8

9 11

10 12

Due to the current hub value of node 3, the authority value of node 9 increases, and in turn also the hub value of node 3 will
increase, and therefore the authority values of nodes 4, 5, and 6. Therefore, the new edge causes I(G1 ) to increase. On the
other hand, the authority value of node 9 does not impact on I(G3 ), where it is a source, and for the same reason the new
edge has no impact on I(G2 ).
22.4) Finally, after the addition of the last edge:
1 4

2 5 7

3 6 8

9 11

10 12

The edge impacts on the authority score of node 3, therefore I(G1 ) and I(G2 ) do not change, and the hub
score of node 10 (and hence the authorities of nodes 11 and 12) is increased. Therefore, the new edge only
impacts on I(G3 ).

Exercise 23
A set of four web pages (A, B, C and D) is completely connected: all pages contain links to every other page,
while no page contains links to itself.
23.1) Compute the PageRank score of all pages.
23.2) Now add web page E, and two links: one from C to E, the second from E to D (so that E has exactly an
incoming link and an outgoing link). Compute the PageRank score of all pages.

Exercise 24
Consider a document corpus with m = 6 documents, n = 5 terms. Suppose that documents have been clustered
into m′ = 2 clusters and terms have been clustered into n′ = 2 clusters. The following document-term matrix
and cluster attribution has been determined:
1 2 3 4 5
1 1 2 1 2
1 1 X
2 2 X X
3 1 X X X
4 1 X X
5 2 X X
6 2 X X
24.1) Consider the Jaccard index as similarity measure. Suppose that all we know about a document is that it
contains term 2. Which other term is most likely to occur in the same document?
24.2) Compute the following probabilities for all suitable index values:
• the probability pi′ that a random document belongs to cluster i′ ;
• the probability pj ′ that a random item belongs to cluster j ′ ;
• the probability pi′ j ′ that a document in cluster i′ contains a term in cluster j ′ .
24.3) Perform a step of the Gibbs Sampling technique on document 4 by computing the posterior probabilities
π4→i′ for i′ = 1, 2. Was the proposed cluster attribution likely, or will it be probably changed?
Exercise 25
Given the following three documents (each row is a document and each cell corresponds to a term and contains
its term id)

1 1 2 1 5 2 2
2 4 3 3 1 2 1
3 2 2 5 4 3 3

assume a multinomial model for the document generation and estimate the parameters of the term distribution by
using the maximum likelihood estimation method. (Show all the steps to obtain the best parameter estimation)
As all the documents have the same length, assume P (L = ld |Θ) = 1 in the multinomial model
P (ld , n(d, t)|Θ).

Solution — The multinomial model for a document generation is the following:


!
ld Y n(d,t)
P (d|Θ) = P (ld , n(d, t)|Θ) = P (L = ld |Θ) θt (1)
{n(d, t)}
t∈d

where Θ = (θt ∀t ∈ T ) and T is our vocabulary. Q


We have a set D = (d1 , d2 , d3 ) of iid observations, thus P (D|Θ) = d∈D P (d|Θ)
In this case, as we assume P (L = ld |Θ) = 1, then:
!
ld Y Y n(d,t)
P (D|Θ) = θ (2)
{n(d, t)} d t t

This model corresponds to the likelihood function L(Θ|D).


We want to estimate the Θ parameters which maximize the likelihood function. We can do that by computing the partial
derivatives with respect to each one of the parameters θt and putting them equal to zero.
From now on we will consider the log likelihood function which is easier to derive:
!
ld XX
logL(Θ|D) = log( )+ n(d, t)log(θt ) (3)
{n(d, t)} d t
P
moreover there is one constraint to the maximization, namely t θt = 1, thus we perform a standard Lagrangian
optimization:
" ! #
∂ ld XX X
log( )+ n(d, t)log(θt ) − λ( θt − 1) = 0 (4)
∂θt {n(d, t)} d t t
P
∂logL d n(d, t)
= −λ=0 (5)
∂θt θt
then the estimation of our parameters is:
P
d n(d, t)
θt = (6)
λ
P P Pd n(d,t)
In order to compute
P P the LagrangianP multiplier λ, we consider the constraint t θt = 1 which becomes t λ
=
1 and thus λ = d t n(d, t) = d ld = n · ld , where n = |D|.
The parameters are: P
d n(d, t)
θt = (7)
n · ld
4
Substituting the values we obtain θ1 = 3×6
= 29 , θ2 = 7
18
, θ3 = 5
18
, θ4 = 91 , θ5 = 19 .

Exercise 26
Solve the previous exercise by using the least squares method. (Show all the steps to obtain the best parameter
estimation)
Exercise 27
Given the following three documents (each row is a document and each cell corresponds to a term and contains
its term id)
1 1 2 1 5 2 2 3 2
2 1 3 1 5 2 2
3 2 2 5 4 3 3 2
assume a multinomial model for the document generation and estimate the parameters of the term distribution
by using the least square method. (Show all the steps to obtain the best parameter estimation)

Exercise 28
A document set has been partitioned into two clusters. For each cluster, 100 2-shingles have been sampled
randomly. Shingles were divided into four categories: “2, 4” (term 2 followed by term 4), “2, 4̄” (term 2 followed
by any term different from 4), “2̄, 4” (any term different from 2 followed by term 4) and “2̄, 4̄”.
The following tables show the results of our sampling in clusters 1 and 2 respectively:
C1 4 4̄ C2 4 4̄
2 20 10 2 10 10
2̄ 10 60 2̄ 0 80
28.1) Based on the above contingency tables, what are the relative frequencies of terms 2 and 4 within the two
clusters? What are the relative frequencies of terms different from 2 and 4?
28.2) Consider the following term-based generative model for documents:
1. Choose the cluster by an unbiased coin throw.
2. Document length is always 6.
3. Choose every term of the document independently with probability equal to the frequency of the term
within the chosen cluster.
(Hint: the model depends on four free parameters, i.e., probability of term 2 and probability of term 4 within
each cluster).
Use this model to determine the maximum-likelihood clustering of the three following documents:
• d1 = 1, 2, 4, 2, 3, 5
• d2 = 3, 2, 1, 3, 5, 4
• d3 = 1, 2, 4, 2, 4, 2
28.3) Use a similar generative model based on shingles instead of terms, considering every shingle as indepen-
dent of the others (so that every document is made of 5 independent shingles).
Use this model to determine the maximum-likelihood clustering of the same documents.
Nota bene: this is not an exercise about parameter estimation. Parameters are already given, only document
attribution to clusters must be decided.

Solution — 28.1) We just count the frequency of shingles containing term 2 in the two clusters and divide by the
total number of samples. In cluster 1, for example, 30 samples out of 100 contain term 2. We do similarly for term 4, then
compute the remaining probability:

Term 2 Term 4 Any other term


C1 .3 .3 .4
C2 .2 .1 .7

28.2) We must compute the probability for each document to be generated within each cluster. Since every term is
generated independently, the probability of the document is just the probability of each term being selected in its position.
Let us define as pij the probability for document i to be generated in cluster j. For example:

p21 = .4 × .3 × .4 × .4 × .4 × .3 = .002304

Probabilities are:
C1 C2
d1 .001728 .001372
d2 .002304 .004802
d3 .000972 .000056
Therefore, documents d1 and d3 are attributed to cluster C1 , document d2 to cluster C2 .
28.3) Same computation with shingles:
C1 C2
d1 .00432 .00512
d2 .00216 0
d3 .00864 .00512
Notice that the probability that d2 is generated in cluster C2 is null because it contains the shingle “2̄, 4”. Therefore,
documents d2 and d3 are attributed to cluster C1 , while d1 is attributed to cluster C2 .

Exercise 29
As a part of a clustering method, we decide to compute for each document d the number of occurrences of the
most frequent term in that document:
N (d) = max n(t, d).
t∈T

We want to model N (d) as a Poisson random variable with parameter λ, so that given a random document d in
our corpus we have
λn e−λ
Pr(N (d) = n; λ) =
n!
29.1) Given a sampled document set d1 , . . . , dk , show that the maximum likelihood estimate of λ in the Poisson
model is the average of N (di ).
29.2) Determine λ based on the following sample:
• d1 = (1, 2, 3, 4, 3, 2, 3, 3, 2, 1, 5, 6, 3)
• d2 = (3, 2, 4, 4, 2, 3, 2, 4, 5, 1, 6)
• d3 = (6, 4, 3, 5, 2, 6, 1, 7)
• d4 = (6, 5, 4, 3, 2, 1, 2, 6, 2, 2)
• d5 = (4, 3, 4, 2, 5, 4, 1, 6, 3)

Solution —
29.1) The likelihood of λ with respect to the sample set is
k k
Y Y λN(di ) e−λ
L(λ; N (d1 ), . . . , N (dk )) = Pr(N (d1 ), . . . , N (dk ); λ) = Pr(N (di ); λ) = .
i=1 i=1
N (di )!

Therefore, the log-likelihood is


k
X
log L(λ : N (d1 ), . . . , N (dk )) = (N (di ) log λ − λ − log N (di )!) ,
i=1

and its derivative with respect to λ is


k „ « k
d X N (di ) 1X
log L(λ; N (d1 ), . . . , N (dk )) = −1 = N (di ) − k.
dλ i=1
λ λ i=1

Equating the derivative to zero, we get


k
1X
λ= N (di ).
k i=1
29.2) By counting:
N (d1 ) = 5, N (d2 ) = 3, N (d3 ) = 2, N (d4 ) = 4, N (d5 ) = 3,
the maximum likelihood estimate is the average of these values over the 5-document sample:
5+3+2+4+3 17
λ̂ = = .
5 5

You might also like