You are on page 1of 27

Homework


•  Define
a
loss
func2on
that
compares
two
matrices
(say

mean
square
error)

•  b
=
svd(bellcore)

•  b2
=
b$u[,1:2]
%*%
diag(b$d[1:2])
%*%
t(b$v[,1:2])

•  b3
=
b$u[,1:3]
%*%
diag(b$d[1:3])
%*%
t(b$v[,1:3])

•  More
generally,
for
all
possible
r

–  Let
b.r
=
b$u[,1:r]
%*%
diag(b$d[1:r])
%*%
t(b$v[,1:r])

•  Compute
the
loss
between
bellcore
and
b.r
as
a

func2on
of
r

•  Plot
the
loss
as
a
func2on
of
r

IR
Models

•  Keywords
(and
Boolean
combina2ons
thereof)

•  Vector‐Space
‘‘Model’’
(Salton,
chap
10.1)

–  Represent
the
query
and
the
documents
as
V‐

dimensional
vectors
 ∑ xi ⋅ yi
–  Sort
vectors
by
 sim(x, y) = cos(x, y) = i
| x |⋅ | y |
•  Probabilis2c
Retrieval
Model

–  (Salton,
chap
10.3)

Pr(w | rel)
–  Sort
documents
by

€ score(d) = ∏
w ∈d Pr(w | rel)
Alterna2ve
IR
models


Instructor:
Rada
Mihalcea


Some
of
the
slides
were
adopted
from
a
course
tought
at
Cornell
University
by
William
Y.
Arms


Latent
Seman2c
Indexing

Objective
Replace indexes that use sets of index terms by indexes that use concepts.
Approach
Map the term vector space into a lower dimensional space, using singular
value decomposition.
Each dimension in the new space corresponds to a latent concept in the
original data.
Deficiencies
with
Conven2onal

Automa2c
Indexing

Synonymy: Various words and phrases refer to the same concept (lowers recall).
Polysemy: Individual words have more than one meaning (lowers precision)
Independence: No significance is given to two terms that frequently appear together
Latent semantic indexing addresses the first of these (synonymy),
and the third (dependence)
Bellcore’s
Example

h_p://en.wikipedia.org/wiki/Latent_seman2c_analysis


c1 Human machine interface for Lab ABC computer applications
c2 A survey of user opinion of computer system response time
c3 The EPS user interface management system
c4 System and human system engineering testing of EPS
c5 Relation of user-perceived response time to error measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and well-quasi-ordering
m4 Graph minors: A survey
Term
by
Document
Matrix

"bellcore"<‐

structure(.Data
=
c(1,
1,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
1,
1,
1,
1,

0,
1,
0,
0,
0,
0,
1,
0,
1,
1,
0,
0,
1,
0,
0,
0,
0,
1,
0,
0,
0,
2,
0,

0,
1,
0,
0,
0,
0,
0,
0,
0,
1,
0,
1,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,

0,
0,
0,
0,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
1,
0,
0,
0,
0,
0,

0,
0,
0,
0,
0,
1,
1,
1,
0,
0,
0,
0,
0,
0,
0,
0,
1,
0,
1,
1),
.Dim
=
c(

12,
9),
.Dimnames
=
list(c("human",
"interface",
"computer",
"user",

"system",
"response",
"@me",
"EPS",
"survey",
"trees",
"graph",

"minors"),
c("c1",
"c2",
"c3",
"c4",
"c5",
"m1",
"m2",
"m3",
"m4")))


help(dump)

help(source)

Query
Expansion

Query:
Find documents relevant to human computer interaction
Simple Term Matching:
Matches c1, c2, and c4
Misses c3 and c5
Large

Correl‐
a2ons

Correla2ons:
Too
Large
to
Ignore

How
to
compute
correla2ons

round(100 * cor(bellcore))
c1 c2 c3 c4 c5 m1 m2 m3 m4
c1 100 -19 0 0 -33 -17 -26 -33 -33
c2 -19 100 0 0 58 -30 -45 -58 -19
c3 0 0 100 47 0 -21 -32 -41 -41
c4 0 0 47 100 -31 -16 -24 -31 -31
c5 -33 58 0 -31 100 -17 -26 -33 -33
m1 -17 -30 -21 -16 -17 100 67 52 -17
m2 -26 -45 -32 -24 -26 67 100 77 26
m3 -33 -58 -41 -31 -33 52 77 100 56
m4 -33 -19 -41 -31 -33 -17 26 56 100

round(100 * cor(t(bellcore)))
human interface computer user system response time EPS survey trees graph minors
human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29
interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29
computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29
user -38 19 19 100 23 76 76 19 19 -50 -50 -38
system 43 4 4 23 100 4 4 82 4 -46 -46 -35
response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29
time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29
EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29
survey -29 -29 36 19 4 36 36 -29 100 -38 19 36
trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19
graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76
minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100
Height

-1.0 -0.5 0.0 0.5

trees

graph

minors

survey

user

response

time

system
Cluster Dendrogram

EPS

computer

human
plot(hclust(as.dist(‐cor(t(bellcore)))))


interface

plot(hclust(as.dist(‐cor(bellcore))))

Cluster Dendrogram
0.4
0.0

m4
Height

c1
-0.4

c3

c4
-0.8

m1

c2

c5
m2

m3
Correc2ng


for

Large

Correla2ons

Thesaurus

Term
by
Doc

Matrix:

Before
&
Aeer

Thesaurus

Singular
Value
Decomposi2on
(SVD)

X = UDVT

txd txm mxm mxd

D VT
X = U
•  m is the rank of X < min(t, d)
•  D is diagonal
–  D2 are eigenvalues (sorted in descending
order)

•  U UT = I and V VT = I
–  Columns of U are eigenvectors of X XT
–  Columns of V are eigenvectors of XT X
•  m is the rank of X < min(t, d)
•  D is diagonal
–  D2 are eigenvalues (sorted in descending order)

•  U UT = I and V VT = I
–  Columns of U are eigenvectors of X XT
–  Columns of V are eigenvectors of XT X
Dimensionality

Reduc2on

txd txk kxk kxd

D VT

^ =
X U

k is the number of latent concepts


(typically 300 ~ 500)
Dimension
Reduc2on
in
R

b
=
svd(bellcore)

b2
=
b$u[,1:2]
%*%
diag(b$d[1:2])
%*%
t(b$v[,1:2])

dimnames(b2)
=
dimnames(bellcore)

par(mfrow=c(2,2))

plot(hclust(as.dist(‐cor(bellcore))))

plot(hclust(as.dist(‐cor(t(bellcore)))))

plot(hclust(as.dist(‐cor(b2))))

plot(hclust(as.dist(‐cor(t(b2)))))

Cluster Dendrogram Cluster Dendrogram

-1.0
-0.8 0.0
Height

Height

trees

survey

computer
human
interface
m4

graph
minors

user

system
EPS
response
time
c1

c3

c4
m1

c2

c5
m2

m3

as.dist(-cor(bellcore)) as.dist(-cor(t(bellcore)))
hclust (*, "complete") hclust (*, "complete")

Cluster Dendrogram Cluster Dendrogram


0.5

-1.0
Height

Height

survey
system
human
EPS
interface

user

trees
response
time
computer

graph
minors
-1.0

c2

c5
m4

c4
m1

m2

m3

c1

c3

as.dist(-cor(b2)) as.dist(-cor(t(b2)))
hclust (*, "complete") hclust (*, "complete")
SVD

B
BT
=
U
D2
UT

BT
B
=
V
D2
VT


Doc


Term


Latent

Dimension
Reduc2on

Block
Structure

round(100*cor(bellcore))
c1 c2 c3 c4 c5 m1 m2 m3 m4
c1 100 -19 0 0 -33 -17 -26 -33 -33
c2 -19 100 0 0 58 -30 -45 -58 -19
c3 0 0 100 47 0 -21 -32 -41 -41
c4 0 0 47 100 -31 -16 -24 -31 -31
c5 -33 58 0 -31 100 -17 -26 -33 -33
m1 -17 -30 -21 -16 -17 100 67 52 -17
m2 -26 -45 -32 -24 -26 67 100 77 26
m3 -33 -58 -41 -31 -33 52 77 100 56
m4 -33 -19 -41 -31 -33 -17 26 56 100
> round(100*cor(b2))
c1 c2 c3 c4 c5 m1 m2 m3 m4
c1 100 91 100 100 84 -86 -85 -85 -81
c2 91 100 91 88 99 -57 -56 -56 -50
c3 100 91 100 100 84 -86 -85 -85 -81
c4 100 88 100 100 81 -89 -88 -88 -84
c5 84 99 84 81 100 -44 -44 -43 -37
m1 -86 -57 -86 -89 -44 100 100 100 100
m2 -85 -56 -85 -88 -44 100 100 100 100
m3 -85 -56 -85 -88 -43 100 100 100 100
m4 -81 -50 -81 -84 -37 100 100 100 100
Dimension
Reduc2on

Block
Structure

round(100*cor(t(bellcore)))
human interface computer user system response time EPS survey trees graph minors
human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29
interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29
computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29
user -38 19 19 100 23 76 76 19 19 -50 -50 -38
system 43 4 4 23 100 4 4 82 4 -46 -46 -35
response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29
time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29
EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29
survey -29 -29 36 19 4 36 36 -29 100 -38 19 36
trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19
graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76
minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100
> round(100*cor(t(b2)))
human interface computer user system response time EPS survey trees graph minors
human 100 100 93 94 99 82 82 100 -12 -85 -84 -83
interface 100 100 95 96 100 85 85 100 -7 -82 -80 -80
computer 93 95 100 100 96 98 98 93 26 -59 -57 -56
user 94 96 100 100 97 97 97 94 23 -62 -60 -59
system 99 100 96 97 100 88 88 100 -2 -79 -78 -77
response 82 85 98 97 88 100 100 83 46 -40 -38 -37
time 82 85 98 97 88 100 100 83 46 -40 -38 -37
EPS 100 100 93 94 100 83 83 100 -11 -84 -83 -82
survey -12 -7 26 23 -2 46 46 -11 100 63 65 66
trees -85 -82 -59 -62 -79 -40 -40 -84 63 100 100 100
graph -84 -80 -57 -60 -78 -38 -38 -83 65 100 100 100
minors -83 -80 -56 -59 -77 -37 -37 -82 66 100 100 100
The
term
vector
space

t3
The space has as
many dimensions as
there are terms in
the word list. d1 d2

θ t2

t1
Latent concept
vector space

• term
document
query
--- cosine > 0.9

You might also like