You are on page 1of 33

Modern Information Retrieval

Chapter 5 Query Operations

報告人:林秉儀
學號: 89522022
Introduction

• It is difficult to formulate queries which are well designed


for retrieval purposes.
• Improving the initial query formulation through query exp
ansion and term reweighting.
Approaches based on:
– feedback information from the user
– information derived from the set of documents initially
retrieved (called the local set of documents)
– global information derived from the document collectio
n
User Relevance Feedback

• User is presented with a list of the retrieved docu


ments and, after examining them, marks those whi
ch are relevant.
• Two basic operation:
– Query expansion : addition of new terms from r
elevant document
– Term reweighting : modification of term weight
s based on the user relevance judgement
User Relevance Feedback

• The usage of user relevance feedback to:


– expand queries with the vector model
– reweight query terms with the probabilistic mo
del
– reweight query terms with a variant of the prob
abilistic model
Vector Model

• Define:
– Weight:
Let the ki be a generic index term in the set K = {k1,
…, kt}.
A weight wi,j > 0 is associated with each index term
ki of a document dj.
– document index term vector:
the document dj is associated with
 an index term ve

d j by
ctor dj representd d j  ( w1, j , w2, j ,  , wt , j )
Vector Model (cont’d)
• Define
– from the chapter 2
N
the term weighting : wi , j  f i , j  log
ni
freq i , j
the normalized frequency : fi, j 
max l freq l , j
freqi,j be the raw frequency of ki in the document dj
N
nverse document frequency for ki : idf i  log
ni
 0.5 freq i ,q 
the query term weight: wi ,q   0.5    log N
 max l freq l ,q  ni

Vector Model (cont’d)

• Define:
– query vector:
 
query vector qq is defined as q  ( w1,q , w2,q ,  , wt ,q )
– Dr: set of relevant documents identified by the: user
– Dn: set of non-relevant documents among the retrieved
documents
– Cr: set of relevant documents among all documents in t
he collection
– α,β,γ: tuning constants
Query Expansion and Term Reweighting
for the Vector Model
• ideal case
Cr : the complete set Cr of relevant documents to a
given query q
– the best query vector is presented by
 1  1 
q opt 
Cr


dj 
d j C r N  Cr


dj
d j C r

• The relevant documents Cr are not known a priori,


should be looking for.
Query Expansion and Term Reweighting
for the Vector Model (cont’d)
• 3 classic similar way to calculate the modified que
ry   
qm    
– Standard_Rochio: qm  q  Dr d 
C
dj  
Dn d C

dj
j r j r

   
– Ide_Regular: qm  q   

d j 
d j Dr
 j
d
d j Dn
   
– Ide_Dec_Hi:
qm  q  
d D


d j   max non relevant (d j )
j r

• the Dr and Dn are the document sets which the user


judged
Term Reweighting for the Probabilistic
Model

• simialrity: the correlation between the vectors djd jand q
this correlation can be quantified as:
 
dj q dj
sim(d j , q )   
dj  q
Q
• The probabilistic model according to the probabilistic rankin
g principle.
– Pp(ki|R)
(ki | R ) : the probability of observing the term ki in the s
et R of relevant document
– Pp(ki|R)
(ki | R ) : the probability of observing the term ki in the s
et R of non-relevant document
(5.
2)
Term Reweighting for the Probabilistic
Model
• The similarity of a document dj to a query q can be express
ed as
 P (ki | R ) P ( ki | R ) 
sim(d j , q )   wi ,q wi , j  log
  log 

 1  P ( k i | R ) 1  P ( k i | R ) 
• for the initial search
– estimated above equation by following assumptions
n
P ( k i | R )  0 .5 P ( k i | R )  i
N
ni is the number of documents which contain the index term ki
get t
 N ni 
sim(d j , q )   wi ,q wi , j  log 
i 1  ni 
Term Reweighting for the Probabilistic
Model (cont’d)
• for the feedback search
– The P(k
P(kii|R)
| R)and P(k
P (ik|R) can be approximated as:
i | R)
ni  Dr ,i Dr ,i
P ( ki | R )  P ( ki | R ) 
N  Dr Dr
the Dr is the set of relevant documents according to the user judgement
the Dr,i is the subset of Dr composed of the documents contain the term
ki
– The similarity of dj to q:

t  Dr ,i ni  Dr ,i 
sim(d j , q )   wi ,q wi , j log 
 Dr  Dr ,i  Dr  (ni  Dr ,i ) 
• There is no query i 1 expansion occurs in theNprocedure.
Term Reweighting for the Probabilistic
Model (cont’d)
• Adjusment factor
– Because of |Dr| and |Dr,i| are certain small, take
a 0.5 adjustment factor added to the PP(k
(ki i||R)
R ) and
P (kii|R)
P(k | R)
Dr ,i  0.5 ni  Dr ,i  0.5
P (ki | R )  P ( ki | R ) 
Dr  1 N  Dr  1

– alternative adjustment factor ni/N ni


n ni  Dr ,i 
Dr ,i  i
N P ( ki | R )  N
P (ki | R ) 
Dr  1 N  Dr  1
A Variant of Probabilistic Term Reweig
hting
• 1983, Croft extended above weighting scheme by
suggesting distinct initial search methods and by a
dapting the probabilistic formula to include withi
n-document frequency weights.
• The variant of probabilistic term reweighting:
t
sim(d j , q )   wi ,q wi , j Fi , j ,q
i 1
the Fi,j,q is a factor which depends on the triple [ki,dj,q].
A Variant of Probabilistic Term Reweig
hting (cont’d)
• using disinct formulations for the initial search and feedback sear
ches
– initial search:
fi, j
Fi , j ,q  C  idf i  f i , j f i , j  K  (1  K )
max( f i , j )
the fi,j is a normalized within-document frequency
C andfKi , j should be adjusted according to the collection.

• feedback searches:

 P (ki | R ) 1  P (ki | R ) 
Fi , j ,q   C  log  log  fi, j

1  P ( k | R ) P ( k | R )
• empty text  i i i i 
Automatic Local Analysis
• Clustering : the grouping of documents which satisfy a set
of common properties.
• Attempting to obtain a description for a larger cluster of rel
evant documents automatically :
To identify terms which are related to the query terms such
as:
– Synonyms
– Stemming
– Variations
– Terms with a distance of at most k words from a query
term
Automatic Local Analysis (cont’d)

• The local strategy is that the documents retrieved f


or a given query q are examined at query time to d
etermine terms for query expansion.
• Two basic types of local strategy:
– Local clustering
– Local context analysis
• Local strategies suit for environment of intranets,
not for web documents.
Query Expansion Through Local Cluster
ing
• Local feedback strategies are that expands the que
ry with terms correlated to the query terms.

Such correlated terms are those present in local clu


sters built from the local document set.
Query Expansion Through Local Cluster
ing (cont’d)
• Definition:
– Stem:
A V(s) be a non-empty subset of words which are grammatical v
ariants of each other. A canonical form s of V(s) is called a stem.
Example:
If V(s) = { polish, polishing, polished} then s=polish
– Dl :the local document set, the set of documents retrieved for a g
iven query q
• Strategies for building local clusters:
– Association clusters
– Metric clusters
– Scalar clusters
Association clusters

• An association cluster is based on the co-occurrence of stems


inside the documents
• Definition:
– fsi,j
f s , j : the frequency of a stem si in a document dj , d j  Dl
i

m  (mijij) ) be an association matrix with |Sl| row and |Dl
– Let m=(m
| columns, where mij=f mij . f si , j
   t si,j
– The matrix s=mm s  mm is a local stem-stem association mat
rix.

s
– Each element su,v in s expresses a correlation cu,v between
the stems su and sv: 
cu ,v  f su , j  f s v , j
d j D l
Association Clusters (cont’d)

• The correlation factor cu,v qunatifies the absolute fr


equencies of co-occurrence

– The association matrix s unnormalized
su ,v  cu ,v

– Normalized
cu ,v
su ,v 
cu ,u  cv ,v  cu ,v
Association Clusters (cont’d)

• Build local association clusters:



– Consider the u-th row in the association matrix s
– Let Su(n) be a function which takes the u-th row
and returns the set of n largest values su,v, wher
e v varies over the set of local stems and vnoteq
vu
ualtou
– Then su(n) defines a local association cluster a
round the stem su.
Metric Clusters

• Two terms which occur in the same sentence seem


more correlated than two terms which occur far ap
art in a document.
• It migh be worthwhile to factor in the distance bet
ween two terms in the computation of their correla
tion factor.
Metric Clusters

• Let the distance r(ki, kj) between two keywords ki


and kj in a same document.
• If ki and kj are in distinct documents we take r(ki, k
j)=  
• A local stem-stem metric correlation matrix ss is de
fined as :
Each element su,v of s expresses a metric
correlation cu,v between
1 the setms su, and sv
cu ,v   
ki V ( su ) k j V ( sv ) r (ki , k j )
Metric Clusters

• Given a local metric matrix ss , to build local metri
c clusters:
– Consider the u-th row in the association matrix
– Let Su(n) be a function which takes the u-th row
and returns the set of n largest values su,v, where
v varies over the set of local stems and v v  u
– Then su(n) defines a local association cluster ar
ound the stem su.
Scalar Clusters

• Two stems with similar neighborhoods have some


synonymity relationship.
• The way to quantify such neighborhood relationsh
ips is to arrange all correlation values su,i in a vect

su all correlation
or su, to arrange  values sv,i in anothe
sv
r vector sv, and to compare these vectors through a
scalar measure.
Scalar Clusters
 
u  ( su ,1 ,su2,
• Let ssu=(su1, su , 2 ,   ( sv ,1 , sv , 2 ,sv2,
, su ,n )) and svsv=(sv1,
…,sun  , ssvn)
v , n ) be
two vectors of correlation values for the stems su and sv.

s  ( su ,v))be a scalar association matrix.
• Let s=(su,v
• Each su,v can be defined as  
su  sv
su ,v   
su  sv

• Let Su(n) be a function which returns the set of n largest va


v . uThen S (n) defines a scalar cluster around t
lues su,v , v=u u
he stem su.
Interactive Search Formulation

• Stems(or terms) that belong to clusters associated


to the query stems(or terms) can be used to expand
the original query.
• A stem su which belongs to a cluster (of size n) ass
ociated to another stem sv ( i.e. su  S v (n))is said t
o be a neighbor of sv .
Interactive Search Formulation (cont’d)

• figure of stem su as a neighbor of the stem sv

  


 Sv(n)
 
su 
 
sv 

 

 
Interactive Search Formulation (cont’d)

• For each stem , select m neighbor stems from the cluster Sv


(n) (which might be of type association, metric, or scalar) a
nd add them to the query.
• Hopefully, the additional neighbor stems will retrieve new
relevant documents. 新增的鄰近字根會找出新的 relevant
documents.
• Sv(n) may composed of stems obtained using correlation fa
ctors normalized and unnormalized.
– normalized cluster tends to group stems which are more rare.
– unnormalized cluster tends to group stems due to their large freque
ncies.
Interactive Search Formulation (cont’d)

• Using information about correlated stems to improve the se


arch.
– Let two stems su and sv be correlated with a correlation factor cu,v.
– If cu,v is larger than a predefined threshold then a neighbor stem of su
can also be interpreted as a neighbor stem of sv and vice versa.
– This provides greater flexibility, particularly with Boolean queries.
– Consider the expression (su + sv) where the + symbol stands for disju
nction.
– Let su' be an neighbor stem of su.
– Then one can try both(su'+sv) and (su+su) as synonym search expressi
ons, because of the correlation given by cu,v.
Query Expansion Through Local Contex
t Analysis
• The local context analysis procedure operates in three step
s:
– 1. retrieve the top n ranked passages using the original
query.
This is accomplished by breaking up the doucments init
ially retrieved by the query in fixed length passages (for
instance, of size 300 words) and ranking these passages
as if they were documents.
– 2. for each concept c in the top ranked passages, the si
milarity sim(q, c) between the whole query q (not indivi
dual query terms) and the concept c is computed using
a variant of tf-idf ranking.
Query Expansion Through Local Contex
t Analysis
– 3. the top m ranked concepts(accroding to sim(q,
c) ) are added to the original query q.
To each added concept is assigned a weight giv
en by 1-0.9 × i/m where i is the position of the c
oncept in the final concept ranking .
The terms in the original query q might be stres
sed by assigning a weight equal to 2 to each of t
hem.

You might also like