You are on page 1of 35

Retrieval Models I

Boolean, Vector Space, Probabilistic


What is a retrieval model?
• An idealization or abstraction of an actual process (retrieval)
– results in measure of similarity b/w query and document

• May describe the computational process


– e.g. how documents are ranked
– note that inverted file is an implementation not a model

• May attempt to describe the human process


– e.g. the information need, search strategy, etc

• Retrieval variables:
– queries, documents, terms, relevance judgements, users, information needs

• Have an explicit or implicit definition of relevance

 Allan, Ballesteros, Croft, and/or Turtle


Mathematical models
• Study the properties of the process
• Draw conclusions or make predictions
– Conclusions derived depend on whether model is a good
approximation of the actual situation

• Statistical models represent repetitive processes


– predict frequencies of interesting events
– use probability as the fundamental tool

 Allan, Ballesteros, Croft, and/or Turtle


Exact Match Retrieval
• Retrieving documents that satisfy a Boolean expression
constitutes the Boolean exact match retrieval model
– query specifies precise retrieval criteria
– every document either matches or fails to match query
– result is a set of documents (no order)

• Advantages:
– efficient
– predictable, easy to explain
– structured queries
– work well when you know exactly what docs you want
 Allan, Ballesteros, Croft, and/or Turtle
Exact-match Retrieval Model
• Disadvantages:
– query formulation difficult for most users
– difficulty increases with collection size (why?)
– indexing vocabulary same as query vocabulary
– acceptable precision generally means unacceptable recall
– ranking models are consistently better

• Best-match or ranking models are now more common


– we’ll see these later
 Allan, Ballesteros, Croft, and/or Turtle
Boolean retrieval
• Most common exact-match model
– queries: logic expressions with doc features as operands
– retrieved documents are generally not ranked
– query formulation difficult for novice users

• Boolean queries
– Used by Boolean retrieval model and in other models
– Boolean query  Boolean model

• “Pure” Boolean operators: AND, OR, and NOT


• Most systems have proximity operators
• Most systems support simple regular expressions as search
terms to match spelling variants
 Allan, Ballesteros, Croft, and/or Turtle
WESTLAW: Large commercial system

• Serves legal and professional market


– legal materials (court opinions, statutes, regulations,…)
– news (newspapers, magazines, journals, …)
– financial (stock quotes, SEC materials, financial analyses,
…)
• Total collection size: 5-7 Terabytes
• 700,000 users
– Claim 47% of legal searchers
– 120K daily on-line users (54% of on-line users)
• In operation since 1974 (51 countries as of 2002)
• Best-match (and free text queries) added in 1992
 Allan, Ballesteros, Croft, and/or Turtle
WESTLAW query language features
• Boolean operators

• Proximity operators
– Phrases - “West Publishing”
– Word proximity - West /5 Publishing
– Same sentence - Massachusetts /s technology
– Same paragraph - “information retrieval” /p exact-match

• Restrictions
– (e.g. DATE(AFTER 1992 & BEFORE 1995))

 Allan, Ballesteros, Croft, and/or Turtle


WESTLAW query language features
• Term expansion
– Universal characters (THOM*SON)
– Truncation (THOM!)
– Automatic expansion of plurals, possessives

• Document structure (fields)


– query expression restricted to named document component
(title, abstract)
– title(“inference network”) & cites(Croft) & date(after 1990)

 Allan, Ballesteros, Croft, and/or Turtle


Features to Note about Queries
• Queries are developed incrementally
– add query components until reasonable number of
documents retrieved
• “query by numbers”
– implicit relevance feedback
• Queries are complex
– proximity operators used very frequently
– implicit OR for synonyms
– /NOT is rare
• Queries are long (av. 9-10 words)
– not typical Internet queries (1-2 words)
 Allan, Ballesteros, Croft, and/or Turtle
WESTLAW Query Processing
• Stop words
– indexing stopwords (one)
– query stopwords (about 50)
• stopwords retained in some contexts

• Thesaurus to aid user expansion of queries

• Document display
– sort order specified by user (e.g. date)
– query term highlighting

 Allan, Ballesteros, Croft, and/or Turtle


WESTLAW query example

• What is the statute of limitations in cases involving


the federal tort claims act?
LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3
CLAIM

• Are there any cases which discuss negligent


maintenance or failure to maintain aids to navigation
such as lights, buoys, or channel markers?
– NOT NEGLECT! FAIL! NEGLIG! /5 MAINT!
REPAIR! /P NAVIGAT! /5 AID EQUIP! LIGHT BUOY
“CHANNEL MARKER”
 Allan, Ballesteros, Croft, and/or Turtle
Exact-Match VS Best-Match Retrieval
• Exact-match
– query specifies precise retrieval criteria
– every document either matches or fails to match query
– result is a set of documents
• Best-match
– Most common retrieval approach
– generally better retrieval performance
– query describes good or “best” matching document
– result is ranked list of documents, good documents appear at
top of ranking
– result may include estimate of quality
– hard to compare best- and exact-match in principled way
 Allan, Ballesteros, Croft, and/or Turtle
Best-Match Retrieval
• Advantages of Best Match:
– Significantly more effective than exact match

– Uncertainty is a better model than certainty

– Easier to use (support full text queries)

– Similar efficiency (based on inverted file implementations)

 Allan, Ballesteros, Croft, and/or Turtle


Best Match Retrieval
• Disadvantages:
– More difficult to convey an appropriate cognitive model
(“control”)

– Full text is not natural language understanding (no “magic”)

– Efficiency is always less than exact match (cannot reject


documents early)

• Boolean or structured queries can be part of a best-


match retrieval model
 Allan, Ballesteros, Croft, and/or Turtle
Why did Commercial services ignore best-
match (for 20 years)?
• Inertia/installed software base
• “Cultural” differences
• Not clear tests on small collections would scale
• Operating costs
• Training
• Few convincing user studies
• Risk

 Allan, Ballesteros, Croft, and/or Turtle


Retrieval Models
• A retrieval model specifies the details of:
– Document representation
– Query representation
– Retrieval function

• Determines a notion of relevance.


– either implicitly or explicitly

• Notion of relevance can be binary or continuous


(i.e. ranked retrieval).

 Allan, Ballesteros, Croft, and/or Turtle


Classes of Retrieval Models
• Boolean models (set theoretic)
– Extended Boolean

• Vector space models (statistical/algebraic)


– Generalized VS
– Latent Semantic Indexing

• Probabilistic models

 Allan, Ballesteros, Croft, and/or Turtle


Other Model Dimensions
• Logical View of Documents
– Index terms
– Full text
– Full text + Structure (e.g. hypertext)

• User Task
– Retrieval
– Browsing

 Allan, Ballesteros, Croft, and/or Turtle


Retrieval Tasks

• Ad hoc retrieval: Fixed document corpus, varied


queries.
• Filtering: Fixed query, continuous document stream.
– User Profile: A model of relative static preferences.
– Binary decision of relevant/not-relevant.

• Routing: Same as filtering but continuously supply


ranked lists rather than binary filtering.

 Allan, Ballesteros, Croft, and/or Turtle


Issues for Vector Space Model
• How to determine important words in a document?
– Word sense?
– Word n-grams (and phrases, idioms,…)  terms

• How to determine degree of importance of a term


within a document and within the entire collection?

• How to determine the degree of similarity between a


document and the query?

• In the case of the web, what is a collection and what


are the effects of links, formatting information, etc.?
 Allan, Ballesteros, Croft, and/or Turtle
The Vector-Space Model
• Assume t distinct terms remain after pre-processing
(the index terms or the vocabulary).
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|

• Each term, i, in a document or query, j, is given a


real-valued weight, wij.
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
 Allan, Ballesteros, Croft, and/or Turtle
Document Collection
• Collection of n documents can be represented in the vector
space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document;
– zero means the term has no significance in the document or it
simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
• tf-idf weighting typical: wij = tfij*idfi = tfij log2 (N/ dfi)
 Allan, Ballesteros, Croft, and/or Turtle
Graphic Representation

Example: T3
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3 5
Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
7
T2 similarity? Distance? Angle?
Projection?
 Allan, Ballesteros, Croft, and/or Turtle
Term Weights: Term Frequency
• More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j

• May want to normalize term frequency (tf) across


the entire corpus:
tfij = fij / max{fij}

 Allan, Ballesteros, Croft, and/or Turtle


Term Weights: IDF
• Terms that appear in many different documents are
less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• Recall: indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.

 Allan, Ballesteros, Croft, and/or Turtle


Query Vector
• Query vector is typically treated as a document and
also tf-idf weighted.

• Alternative is for the user to supply weights for the


given query terms.

 Allan, Ballesteros, Croft, and/or Turtle


Similarity Measure
• A similarity measure is a function that computes
the degree of similarity between two vectors.

• Using a similarity measure between the query and


each document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.

– It is possible to enforce a certain threshold so that the


size of the retrieved set can be controlled.

 Allan, Ballesteros, Croft, and/or Turtle


Similarity Measure - Inner Product
• Similarity between vectors for the document di and query q can
be computed as the vector inner product:
t

sim(dj,q) = dj•q =  wij · wiq


i 1

where wij is the weight of term i in document j and wiq is the weight of
term i in the query

• For binary vectors, the inner product is the number of matched


query terms in the document (size of intersection).

• For weighted term vectors, it is the sum of the products of the


weights of the matched terms.

 Allan, Ballesteros, Croft, and/or Turtle


Properties of Inner Product

• The inner product is unbounded.

• Favors long documents with a large number of


unique terms.

• Measures how many terms matched but not how


many terms are not matched.

 Allan, Ballesteros, Croft, and/or Turtle


Inner Product -- Examples
ure r e nt ion
al ase tect ute e m at
v g m
rie atab rchi mp ext ana for
Binary: t
re d a co t m in
– D = 1, 1, 1, 0, 1, 1, 0
Size of vector = size of vocabulary = 7
– Q = 1, 0 , 1, 0, 0, 1, 1 0 means corresponding term not found
in document or query
sim(D, Q) = 3

Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10


sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

 Allan, Ballesteros, Croft, and/or Turtle


Cosine Similarity Measure
• Cosine similarity measures the cosine of the angle between two vectors.
t3
• Inner product normalized by the vector lengths.


D1
Q
  t  t1
dj q 
 ( wij  wiq )
CosSim(dj, q) =    i 1
t t
dj  q 2
 wij   wiq
2

i 1 i 1 t2 D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
 Allan, Ballesteros, Croft, and/or Turtle
Simple Implementation
1. Convert all documents in collection D to tf-idf weighted
vectors, dj, for keyword vocabulary V.

2. Convert query to a tf-idf-weighted vector q.

3. For each dj in D do
Compute score sj = cosSim(dj, q)
4. Sort documents by decreasing score.

5. Present top ranked documents to the user.


Time complexity: O(|V|·|D|) Bad for large V & D !
|V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000

 Allan, Ballesteros, Croft, and/or Turtle


Comments on Vector Space Models

• Simple, mathematically based approach.


• Considers both local (tf) and global (idf) word
occurrence frequencies.
• Provides partial matching and ranked results.
• Tends to work quite well in practice despite obvious
weaknesses.
• Allows efficient implementation for large document
collections.
 Allan, Ballesteros, Croft, and/or Turtle
Problems with Vector Space Model
• Assumption of term independence

• Missing semantic information (e.g. word sense).

• Missing syntactic information (e.g. phrase structure,


word order, proximity information).
• Lacks the control of a Boolean model (e.g., requiring a
term to appear in a document).
– Given a two-term query “A B”,
• may prefer a document containing A frequently but not B,
over a document that contains both A and B, but both less
frequently.
 Allan, Ballesteros, Croft, and/or Turtle

You might also like