You are on page 1of 34

Information retrieval

(IR):
traditional model
1. Why? Rationale for the
module. Definition of IR
2. System & user components
3. Exact match & best match
searches
4. Strengths & weaknesses

Tefko Saracevic

1. Why? Rationale for


the module.
Definition of IR
includes problems
addressed in IR

Tefko Saracevic

Why?
Every online database, every
search engine, everything that is
searched online is based in some
way or another on principles
developed in IR
IR is at the heart of searching used in
systems such as DIALOG, LexisNexis
& others

Understanding the basics of IR is a


prerequisite for understanding how
searching of online systems works.

Tefko Sarace

You are asking:


What basic elements and
processes are involved in IR?
What are the conceptual bases
for searching?
How are these applied in
practice?

Tefko Sarace

IR:
- original definition
Information retrieval embraces the
intellectual aspects of the
description of information and its
specification for search, and also
whatever systems, techniques, or
machines are employed to carry
out the operation.
Calvin Mooers, 1951

Tefko Sarace

IR:
Objective & problems
Provide the users with effective
access to & interaction with
information resources.
Problems addressed:
1. How to organize information
intellectually?
2. How to specify search &
interaction intellectually?
3. What systems & techniques to
use for those processes?
Where do you fit?
With what problems do you deal?
Tefko Sarace

2. System & user


components
Traditional IR model
presented

Tefko Saracevic

IR models
Model depicts, represents what is
involved
a choice of features, processes, things
for consideration

Several IR models used over time


traditional: oldest, most used, shows
basic elements involved
treated in this module

interactive: more realistic, favored now,


shows also interactions involved
treated in next module (module 5)

Each has strengths, weaknesses

Tefko Sarace

Description of
traditional IR model
It has two streams of activities
one is the systems side with processes
performed by the system
other is the user side with processes
performed by users & intermediaries (you)
these two sides led to system orientation &
user orientation
in system side automatic processing is done;
in user side human processing is done

They meet at the matching process


where the query is fed into the system and
system looks for documents that match the
query

Also feedback is involved so that things


change based on results
e.g. query is modified & new matching done

Tefko Sarace

Traditional IR model
System

User

Acquisition

Problem

documents, objects

information need

Representation

Representation

indexing, ...

question

File organization

Query
search formulation

Matching
searching

feedba
ck

indexed documents

Retrieved objects
Tefko Sarace

10

Acquisition
(system)
Content: What is in files, resources
in DIALOG first part of blue sheets: File
Description, Subject Coverage

Selection of documents & other


objects from various sources
in blue sheets: Sources

Mostly text based documents


full texts, titles, abstracts ...
but also other objects:
data, statistics, images, maps, trade marks,
sounds ...

Importance:
Determines contents what
is in it
Key to file, resource
selection !!!

Tefko Sarace

11

Representation
of documents, objects
(system)
Indexing many ways :
free text terms (even in full texts)
controlled vocabulary - thesaurus
manual & automatic techniques

Abstracting; summarizing
Bibliographic description:
author, title, sources, date
metadata

Classifying, clustering
Organizing in fields & limits
in DIALOG: Basic Index, Additional Index.
Limits

Basic to what is available


for searching & displaying
Tefko Sarace

12

File organization
(system)
Sequential
record (document) by record

Inverted
term by term; list of records under
each term

Combination: indexes inverted,


documents sequential
When citation retrieved only,
need for document files
Large file approaches
for efficient retrieval by computers

Enables searching & interplay


between types of files
Tefko Sarace

13

Problem
(user)
Related to users task, situation
vary in specificity, clarity

Produces information need


ultimate criterion for effectiveness of
retrieval
how well was the need met?

Inf. need for the same problem may


change, evolve, shift during the IR
process - adjustment in searching
often more than one search for same
problem over time
you will experience this in your term project

Critical for examination


in interview
Tefko Sarace

14

Representation - question
( user & possibly system)
Non-mediated: end user alone
Mediated: intermediary + user
interviews; human-human interaction

Question analysis
selection, elaboration of terms
various tools may be used
thesaurus, classification schemes,
dictionaries, textbooks, catalogs

Focus toward
deriving search terms & logic
selection of files, resources

Subject to feedback changes


Critical roles of intermediary - you

Determines search specification


- a dynamic process
Tefko Sarace

15

Query - search statement


(user & system)
Translation into systems requirements &
limits
start of human-computer interaction
query is the thing that goes into the computer

Selection of files, resources


Search strategy - selection of:

search terms & logic


possible fields, delimiters
controlled & uncontrolled vocabulary
variations in effectiveness tactics

Reiterations from feedback


several feedback types: relevance feedback,
magnitude feedback *...
query expansion & modification

What & how of actual searching


Tefko Sarace

16

Clarifying difference
Question is what user asks and what
you may then have elaborated
Query is what is asked of computer to
match what is put in
Question is transformed into query
Question:
I am interested in major historical
developments in the area of information
retrieval?

Query
history information retrieval (in Google)
history AND information(w)retrieval (in
DIALOG) (plus you have to select which
file(s) to search)

Tefko Sarace

17

Matching - searching
(user & system)
Process of matching, comparing
search: what documents in the file
match the query as stated?

Various search algorithms:


exact match - Boolean
still available in most, if not all systems

best match - ranking by relevance


increasingly used e.g. on the web

hybrids incorporating both


e.g. Target, Rank in DIALOG

Each has strengths, weaknesses


no perfect method exists
and probably never will

Involves many types of search


interactions & formulations
Tefko Sarace

18

Retrieved documents
(from system to user)
Various order of output:
Last In First Out (LIFO); sorted
ranked by relevance
ranked by other characteristics

Various forms of output


In DIALOG: Output options

When citations only: possible


links to document delivery
Base for relevance, utility
evaluation by users
Relevance feedback

What a user (or you) sees, gets,


judges can be specified
Tefko Sarace

19

3. Exact match & best


match searches
Getting to that Boolean and
similar stuff the nitty-gritty
of matching
which actually affects how
you formulate the query

Tefko Saracevic

20

Exact match Boolean search


You retrieve exactly what you ask
for in the query:
all documents that have the term(s)
with logical connection(s), and
possible other restrictions (e.g. to be
in titles) as stated in the query
exactly: nothing less, nothing more

Based on matching following rules


of Boolean algebra, or algebra of
sets
new algebra
presented by circles in Venn
diagrams

Tefko Sarace

21

Boolean algebra

Operates on sets

e.g. set of documents


Has four operations (like in algebra):
1. A: retrieve set A
I want documents that have the term library

2. A AND B: retrieve set that has A and B


often called intersection & labeled A B
I want documents that have both terms library
and digital someplace within

3. A OR B: retrieve set that has either A or B


often called union and labeled A B
I want documents that have either term library
or term digital someplace within

4. A NOT B: retrieve set A but not B


often called negation and labeled A B
I want documents that have term library but if
they also have term digital I do not want those

Tefko Sarace

22

Potential problems
But beware:
digital AND library will retrieve documents
that have digital library (together as a
phrase) but also documents that have
digital in the first paragraph and library in
the third section, 5 pages later, and it
does not deal with digital libraries at all
thus in Google you will ask for digital
library and in DIALOG for
digital(w)library to retrieve the exact
phrase digital library
digital NOT library will retrieve documents
that have digital and suppress those that
along with digital also have library, but
sometimes those suppressed may very
well be relevant. Thus, NOT is also
known as the dangerous operator

Tefko Sarace

23

Boolean algebra depicted


in Venn diagrams
Four basic operations:
e.g. A = digital B= libraries
A
1

B
2

A
1 2

A alone. All documents that have A.


Shade 1 & 2. digital

B
3

A AND B. Shade 2
digital AND libraies

A
1 2

B
3

A OR B. Shade 1, 2, 3
digital OR libraries

A
1 2

B
3

Tefko Sarace

A NOT B. Shade 1
digital NOT libraries

24

Venn diagrams cont.


Complex statements allowed e.g
A

B
2

1
4

3
6

(A OR B) AND C
Shade 4,5,6
(digital OR libraries) AND
Rutgers

C
(A OR B) NOT C
Shade what?
(digital OR libraries) NOT
Rutgers

Tefko Sarace

25

Venn diagrams cont.


Complex statements can be
made
as in ordinary algebra e.g. (2+3)x4

As in ordinary algebra: watch for


parenthesis:
2+(3 x 4)
is not the same as
(2+3)x4
(A AND B) OR C
is not the same as
A AND (B OR C)

Tefko Sarace

26

Best match searching


Output is ranked
it is NOT presented as a Boolean set but in
some rank order

You retrieve documents ranked by how


similar (close) they are to a query (as
calculated by the system)
similarity assumed as relevance
ranked from highest to lowest relevance to the
query
mind you, as considered by the system
you change the query, system changes rank

thus, documents as answers are presented


from those that are most likely relevant
downwards to less & less likely relevant
can be cut at any desired number - e.g. first 10

Tefko Sarace

27

Best match ...

cont.

Best match process deals with


PROBABILITY:
compares the set of query terms with the
sets of terms in documents
calculates a similarity between query &
each document based on common terms &/or
other aspects
sorts the documents in order of similarity
assumes that the higher ranked documents
have a higher probability of being relevant
allows for cut-off at a chosen number

BIG issue: What representation &


similarity measures are better?
better determined by a number of criteria,
e.g. relevance, speed

Tefko Sarace

28

Best match (cont.)


Variety of algorithms (formulas) used
to determine similarity
using statistic &/or linguistic properties
e.g. if digital appears a lot in a given
document relative to its size, that document
will be ranked higher when the query is digital

many proposed & tested in IR research


many developed by commercial
organizations
Google also uses calculations as to number
of links to/from a document
many algorithms are now proprietary

system ranking and your ranking may not


necessarily be in agreement

Web outputs are mostly ranked


But DIALOG allows ranking as well,
with special commands
Tefko Sarace

29

4. Strengths &
weaknesses

Tefko Saracevic

30

Boolean vs. best


match
Boolean
allows for logic
provides all that
has been
matched
BUT
has no particular
order of output
treats all
retrievals equally
- from the most
to least relevant
ones
often requires
examination of
large outputs

Tefko Sarace

Best match
allows for free
terminology
provides for a
ranked output
provides for cut-off
- any size output
BUT
does not include
logic
ranking method
(algorithm) not
transparent
whose
relevance?

where to cut off?

31

Strengths of traditional
IR model
Lists major components in both
system & user branches
Suggests:
What to explain to users about
system, if needed
What to ask of users for more
effective searching (problem ...)

Selection of component(s) for


concentration
mostly ever better representation

Provides a framework for


evaluation of (static) aspects

Tefko Sarace

32

Weaknesses
Does not address nor account for
interaction & judgment of results
by users
identifies interaction with search only
interaction is a much richer process

Many types of & variables in


interaction not reflected
Feedback has many types &
functions - also not shown
Evaluation thus one-sided

IR is a highly interactive process


- thus additional model(s) needed
Tefko Sarace

33

Interactive models
Explored in next module
Module 5

Tefko Sarace

34

You might also like