CSA3080: Adaptive Hypertext Systems I

CSA3080:
Adaptive Hypertext Systems I

Lecture 5:
Information Retrieval I
Dr. Christopher Staff

Department of Computer Science & AI
University of Malta
CSA3080: Lecture 5 1 of 21
© 2003- Chris Staff cstaff@cs.um.edu.mt University of Malta
Aims and Objectives
• Aims and objectives of IR
• Boolean, Extended Boolean, Statistical Models
Aims and Objectives
• You should end up knowing the major
differences between the simple matching
algorithms
• And what each algorithm considers to be a
relevant document…
• Bear in mind that we will use IR in AHS to find
information relevant to our user so that we can
present it/lead the user to it…
Aims and Objectives of IR
• To facilitate the identification and retrieval of
documents that contain information relevant to
an information need expressed by a user
• We are particularly interested in the retrieval of
information from unstructured data
Boolean Information Retrieval
• Developed in 1950’s
• A document is represented by a collection of
terms that occur in the document (index)
• The unique terms occurring in the collection is
called the vocabulary
• A document is represented by a bit sequence
with a 1 representing a term that is present, and
0 otherwise
• How is the query expressed?
– User thinks of terms that describe an information
need
– Formalises query as a boolean expression
– (Term27 OR Term46) NOT (Term30 AND Term16)
• How does the matching algorithm work?
– Each term in the vocabulary has a set (or postings
list) of documents that contain the term
– For each term in the query, the postings lists are
retrieved
– Set operations (union/disjunction/intersection)
– All documents in the results set are returned
Questions Arising…
• Is this really information retrieval?
– Just because a document contains term x, does it
mean that the document is about term x?
• What about concepts?
– What makes it possible for us to know that a fish
cake is not a dessert? That “she is the apple of my
eye” does not make her a piece of fruit?
• Can we rank the results of a boolean query?
– All we are doing is checking the presence and
absence of terms
– On what grounds would we rank?
• And doesn’t it look suspiciously like
RDBMS/SQL???
Does Boolean IR work?
• BIR works, and works well, when the vocabulary is
reasonably small…
• … when there is no ambiguity in the meaning of terms
• … when the presence of a term in a document is
significant
• … when the absence of a term from a document means
that the document cannot be about that term
Does Boolean IR work?
• Boolean IR is typically applied to a document
surrogate
• And is used with tremendous success in
RDBMS
• Most general purpose IR systems in use on the
Internet are derived from BIR with some
extensions…
Vector Space Model of IR
• Briefly…
– Documents (query) represented by vector of term
weights
– Term weight describes relative importance of term
to document (query)
– Similarity of document to query measured
– The more similar the document to the query, the
more relevant it is
Vector Space Model of IR
• VSM gives improved results over Boolean
• Can rank documents
• Can control output (limit the no. of documents
returned)
• But… not as easy to construct query
– Query does not contain any structure
– Can’t express synonymy, etc.
Extended Boolean Retrieval Model
• Developed to address ranking problem in BIR, using
VSM-like approach, while retaining Boolean query
structures
• E-BIR not as strict as BIR (fuzzy matches supported, as
in VSM)
• Term features can include frequency, location, …
• Reference:
– G. Salton, E. Fox, and U. Wu. (1983). Extended Boolean information retrieval.
Communications of the ACM, 26(12):1022-1036.
• Matching is still based on presence or absence
of terms, but now results can be ranked
• Terms in docs and query are weighted
according to term features
• With structured documents (e.g., HTML), term
features can also include structural information
(title, heading, style, …)
• With location information possible to find
terms NEAR each other
– “computer NEAR science” not the same as
“computer AND science”
– ADJ (adjacent) refines the proximity measure
• Ranked results are an improvement
• NEAR is also useful to improve the quality of
results
• … as is ADJ
• Are we any closer to information retrieval?
Phrase Matching
• Concepts may be evidenced in text as
complex/compound identifiers
– New York, Computer Science, information retrieval,
database management systems, …
• Brings us closer to information retrieval, but still only
identifies documents that contain phrases
• Reference:
– W. Bruce Croft, Howard R. Turtle, and David D. Lewis, (1991), The use of phrases and
structured queries in information retrieval, ACM SIGIR, 32-45.
Phrase Matching
• Extended/Boolean can express phrases using
AND together with proximity operator
• VSM cannot, unless the phrase has been
indexed!
• When is a sequence of words a phrase?
– Croft et. al. use a probabilistic inference net
model…
Conclusion
• The Boolean and Extended Boolean Models give us a
simple mechanism for representing documents
• If we can represent a user’s interest by the presence or
absence of terms, then the user model could be used as
a query to locate interesting document
• Phrase matching allows us to recognise complex
nouns: useful only if phrase is pervasive

CSA3080: Adaptive Hypertext Systems I

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSA3080: Adaptive Hypertext Systems I

Uploaded by

Copyright:

Available Formats

CSA3080:

Adaptive Hypertext Systems I

Dr. Christopher Staff

You might also like