Professional Documents
Culture Documents
CS485
Tibebe Beshah
https://sites.google.com/site/sadcoursesite/
HiLCoE 1
What the course is about
HiLCoE 2
Motivation (why)
HiLCoE 3
How?
Lecture
Assignment
Project
Quizzes/tests
Class work (group)
HiLCoE 4
What?
Course Outline
Information Storage and Retrieval (ISR): Basic concepts
(one and half weeks)
Content (subject) analysis and Representation (two weeks)
Standardization and Identifying term importance (one and
half week)
Models of Modern IR systems (three weeks)
Users and interfaces of information retrieval systems (one
week)
Evaluation of Information Retrieval System (one and half
weeks)
Current areas of research in Information Retrieval (half a
week)
HiLCoE 5
Course Objective:
The general objective of the course is to let students
understand motivation, mechanisms, and potential of
Information Storage and Retrieval. In line with this, at
the end of the course students are expected to:
Know the basic theories and principles of knowledge
organization and Information retrieval
Understand the process of Information Storage and Retrieval
Acquire qualities to study and analyze Information Retrieval
Systems designed using various models
Be familiar with evaluation issues in Information Retrieval
HiLCoE 6
What the course emphasize on
understanding of
Theories
HiLCoE 7
What this course is NOT
An algorithm design course
We might use several related algorithms,
not study them in details
A system development course
Except some assignments may require
you to write or compile some C, C++, java,
etc procedures
We look at an IR system as a whole,
not as an individual components
HiLCoE 8
Knowledge Useful for the Course
Mathematics (set theory, probability, vector
algebra)
Data / File structure
Linguistics (read papers on Linguistics in Information
Science)
System Analysis & Design
Programming in higher level languages such as C, C+
+, Java, VB, etc.
HiLCoE 9
What IR assumes?
HiLCoE 10
References
Robert Korfhage.(1997). Information Storage and Retrieval. John
Wiley & Sons, inc.
G.G. Chowdhury (1990). Introduction to Modern Information
Retrieval. Library Association Publication. London.
Ricardo Baeza-yetes and Berthier Riberio Neto.(1999). Modern
Information Retrieval. New York: ACM Press
A.C. Foskett (1988). The subject approach to Information. 5th ed.
London: Bingley.
Charles T. Meadow (2000). Text Information Retrieval
Systems.2nd ed. Academic Press. New York.
HiLCoE 11
Evaluation
HiLCoE 12
Information Storage and
Retrieval (ISR): Basic concepts
Chapter One
HiLCoE 13
Objectives (Chapter One)
HiLCoE 14
Topics ( Chapter one )
HiLCoE 15
Definition, Foundation,
theories and principles
HiLCoE 16
Information Retrieval Systems?
Document (Web page)
retrieval in response to a
query
Quite effective (at some
things)
Commercially successful
(some of them)
Butwhat goes on behind
the scenes?
How do they work? Web search systems
What happens beyond the • Lycos, Excite, Yahoo, Google,
Web? Live, Northern Light, Teoma,
HotBot, Baidu, …
Web Search System
Web Spider
Document
corpus
Query IR
String System
1. Page1
2. Page2
3. Page3 Ranked
. Documents
.
Information Retrieval - Definition
HiLCoE 19
Cont…
HiLCoE 20
Cont…
HiLCoE 21
Cont…
The definition incorporates all important features of a
good information retrieval system
Representation
Storage
Organization
Access
Evaluation
HiLCoE 22
Cont…
HiLCoE 24
Cont…
HiLCoE 25
Information Retrieval
Can be structured for ease of discussion as
Text IR
HiLCoE 26
Cont…
HiLCoE 27
Entities in IRS
HiLCoE 28
Thus the focus is on
HiLCoE 29
Key Issues IR
Organizing
How to describe information resources or
information-bearing objects in ways so that they
may be effectively used by those who need to use
them
Retrieving
How to find the appropriate information resources
or information-bearing objects for someone’s (or
your own) needs.Build a system that retrieves
documents that users are likely to find relevant to
their queries
This set of assumption underlies the field of IR
HiLCoE 30
IR is an Iterative Process
Creation
Active
Authoring
Modifying
Using Organizing
Creating Indexing
Retention/
Mining Accessing Storing
Filtering Retrieval
Semi-Active
Discard
Distribution
Networking
Utilization Disposition Searching
Inactive
HiLCoE 31
IR-Representation/organizing
HiLCoE 32
Cont…
An example of User information
need:
Find all passages (documents) containing
information on college tennis teams which:
are maintained by a USA university
HiLCoE 33
Cont…
Basic remarks on user information need(in the context of
the World Wide Web):
Such full descripion of the user information need cannot
be used directly to request information using the current
interfaces of web search
The user must first translate his information need into a
HiLCoE 34
IR- Storing/Retrieving
Information storage
How and where is information stored?
Retrieving information
How is information recovered from storage
How to find needed information
Linked with accessing/filtering stage
HiLCoE 35
IR - Accessing/Filtering
HiLCoE 36
Implementation
HiLCoE 37
More on IR
IR is concerned with retrieval of relevant
documents from a large collection of
documents
Relevant documents are identified
according to specific criteria (usually
called query)
IR usually deals with NL text which is not
always well structured and could be
semantically ambiguous
HiLCoE 38
Cont…
HiLCoE 39
A sketch of a searcher… moving through many actions towards a
general goal of satisfactory completion of research related to an
information need
IR is an Iterative Process
Repositories
Q2 Q4
Q3
Q1
Q5
Goals Q0
HiLCoE 40
Historical overview
Organization and storage of knowledge for
ease of access is centuries old
That is, the history of recording knowledge
goes as far as thousands of years.
Important events
Development of writing, Books and printing
technology, News publishing, Journal publishing
(economic reasons- books are not economical in
terms of money and time), Libraries (to put
publications in one centre)
HiLCoE 41
Cont…
Now- The World-Wide-web
A gigantic distributed collection of
heterogeneous information items (web pages)
New challenges
HiLCoE 42
Cont…
HiLCoE 43
Cont…
Simple methods to facilitate access to
single document:
Table of contents,
Keyword index
Classical methods to facilitate access to
collections of documents
Index (keywords, authors)
Hierarchical (Dewey-Decimal classification)
HiLCoE 44
Cont…
HiLCoE 45
Creation of Disciplines
The mechanized era (Sparc Jones and Willett, 1997)
IR systems were mainly used by librarians
for carrying out bibliographic searches in place of
manual tools such as card catalogue and universal
classification systems
The advent of word processing technology (software
+ hardware)
a rapid, wide spread growth in the usage of IR
Increased interest in Web-based distributed information
processing and in the application of IR techniques to
non-textual information
− The growth of knowledge Creation of discipline
HiLCoE 46
Cont…
Discipline oriented era
e.g. Science from philosophy
physics from science
electricity from physics
electronics from electricity
Similarly, information retrieval from the wider
discipline of information science
Then came the Problem Oriented Era
Disciplines are merged to form a new subject.
E.g. Molecular Biology from physics and Biology
(Fosket, 1988)
HiLCoE 47
Cont…
Such growth knowledge gave birth to the
creation of disciplines (domain knowledge)
which then brought about the need for
classification and indexing
Putting related knowledge together
E.g. Science, Arts, and Humanities
Creation of subclasses within classes
Designing ways and means of
accessing information (which is the
area of IR)
HiLCoE 48
Information retrieval system:
components, structures and
functions
How do we characterize IRS?
HiLCoE 49
What is a system?
Is a set of interrelated components interacting
together to achieve an objective.
Has basic characteristics like:
Input, output, environment, boundary, objectives,
components, interaction, interface
Can be living or non-living
What is “systems thinking”?
Do you agree with this? “A system is bigger than the sum
of its components”
HiLCoE 50
Systems thinking
Is a mind set or way of thinking to view the world
(every thing in the world) as a system.
It emphasizes on interaction that keeps the system
alive.
Benefits
Identification of a system leads to abstraction
From abstraction you can think about essential
characteristics of specific system
Abstraction allows analyst to gain insights into specific
system, to question assumptions, provide documentation
and manipulate the system without disrupting the real
situation
HiLCoE 51
Cont..
HiLCoE 52
IRS
Is a system that is capable of storage, retrieval,
and maintenance of information items
The processes of an IR system is to match two
abstractions
Data abstracted in the system
Need [ ] Docs
matching the two sides
HiLCoE 53
Cont…
The goal of IR systems is to help users find
information that satisfies their information
needs
Information items are translated to a query
consisting of keywords (word forms) which
summarizes the description of the user
information needed.
Given the user query, the key goal of an IR
system is to retrieve information which
might be useful or relevant to the user.
HiLCoE 54
Cont…
Examples :
search engines like Google which retrieve
documents on the Web containing the
keywords, and return a ranked list of relevant
indices to documents.
Such Search Engines are word form based
HiLCoE 55
IRS – Presentation of IR
HiLCoE 56
Purpose of IRS
The purpose of an IRS is to capture wanted items
(information ) and to filter out unwanted information
1. Writers ideas
Information is generated by authors
Authors generate large quantities of information every day
Are represented in the form of documents
2. Information need
At the other end we have chain of communications
We have readers, each with its own individual need for
information which has to be selected from the mass available
3. IRS
Helps to organize the information sources in (1)
Helps to bring (1) and (2) together (writer and information
seeker)
HiLCoE 57
Information Retrieval- objective/goal
Minimize search overload of a user who is
locating needed information
Needed information: either
All information in the system relevant to the user needs
Sufficient information in the system to complete a task
Example
Looking for an item to purchase
Looking for an item to purchase at minimal cost
HiLCoE 58
Cont…
Support user search, providing tools to overcome
obstacles such as :
Ambiguities inherent in languages
Homographs – words with identical spelling but with
multiple meanings
Limit to user’s ability to express needs
Lack of system experience or aptitude
Lack of experience in the area being searched
Initially only vague concept of information sought
Differences between user’s vocabulary and authors’
vocabulary: different words with similar meanings
HiLCoE 59
Basic functions of an IRS
Analysis of doc. and organization of information
(creation of document database)
Analysis of users preparation of a strategy to search
the database
Actual searching or matching of users queries with
data base
Retrieval of items that fully or partially match the
search statement
HiLCoE 60
Subsystems of an IR system
The two subsystems of an IR system:
Searching: is an online process of finding relevant
documents in the index list that matches users query
Indexing: is an offline process of organizing documents
using keywords extracted from the collection
Indexingis used to speed up access to desired information from
document collection as per users query
Indexing and searching: are inexorably connected
You cannot search that that was not first indexed in some manner
or other
Indexing of documents is done in order to be searchable
there are many ways to do indexing
to index one needs to select an indexing language
thereare many indexing languages, including inverted file,
sequential file, suffix tree, signature file, etc..
even taking every word in a document is an indexing language
Knowing searching is knowing indexing
Indexing Subsystem
documents
Documents Assign document identifier
HiLCoE 64
High level structure of an IRS
•Indexing
•Searching
HiLCoE 65
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System
T T
r Rules of the game = r
a Formulating query in
Rules for subject indexing +
Thesaurus (which consists of Indexing
a
n terms of
descriptors Lead-In
(Descriptive and
Subject)
n
s
Vocabulary
and
s
l Indexing
Language
l
a Storage of
Storage of
a
profiles
t
Documents t
i i
o o
n Store1: Profiles/ Comparison/ Store2: Document n
Search requests Matching representations
Ranking
Adapted from Soergel, p. 19
Potentially
Relevant
Documents
HiLCoE 66
HiLCoE 67
Cont…
Each incoming item is analysed
Appropriate descriptions are chosen to reflect the
information content of the item
Each item is classified in accordance with established
procedures and incorporated into the collection of existing
information items
Procedures are established for formulating
requests designed to satisfy an information need
and comparing these requests, queries, with the
descriptions of the stored items
These comparisons are the bases for deciding
which items are appropriate for the respective
queries
HiLCoE 68
Cont…
Finally, a retrieval and dissemination
mechanism is used to deliver the information
items of potential interest to the users of the
information system
These steps are all carried in conventional
libraries where libraries where a card catalogue
forms the principal auxiliary tool used in an
information search
In this course we are interested in the process
and methodologies needed to carry out such
tasks automatically
HiLCoE 69
Cont…
Translation from user need to query
Usually, manually ( by user himself)
HiLCoE 70
Another view of IR
Information
Collections
need
Pre-process
text input
Rank
HiLCoE 71
Content analysis areas
Information
need Collections
How is
the query Pre-process
text input How is
constructed
? the text
Parse Query Index processed?
Rank
HiLCoE 72
Information
need Collections
Pre-process
text input
Rank Reformulated
Query
Re-Rank
HiLCoE 73
Example of IRS
HiLCoE 74
General Categories
In-house
Domain specific
Set up by a particular information center
Example: library catalogue
On-line
Designed to provide access to remote database
to a variety of users
HiLCoE 75
Cont…
HiLCoE 76
Cont…
Modern IR systems
Computer based Library systems to search
for books, papers and course information
Search engines, which retrieve documents on
the Web containing the keywords, and return a
ranked list of relevant indices to documents.
Such Search Engines are word form based and
often analyze the link structure of the WWW
Google, AltaVista and Yahoo are most popular
example of IR application nowadays
Electronic encyclopaedia (online or CDROM)
HiLCoE 77
Database Systems Vs
Information Retrieval Systems
HiLCoE 78
DBMS vs IRS
HiLCoE 79
Cont…
On the Information/data
DBMS: structured data (often homogeneous records),
semantic unambiguity
IR systems: unstructured (free text), ambiguity
On the answers/results
DBMS:
Records (tuples)
Perfect precision and recall, each item is relevant (no
ranking)
Well defined results
IR systems
Documents
Imperfect precision and recall, each item has specific
relevance (ranking)
fuzzy results
HiLCoE 80
Cont…
On their relationship
Systems complement each other
On their history
DB grew out of files and traditional business
system
IR grew out of library science and need to
categorize/group/access books/articles
HiLCoE 81
Cont…
Data retrieval
Information retrieval
HiLCoE 82
Cont…
Data retrieval
records contain a set of keywords
Well defined semantics
a single erroneous object implies failure!
Information retrieval
information about a subject or topic
semantics is frequently loose
small errors are tolerated
HiLCoE 83
Cont…
IR system:
interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important
Information retrieval is much more difficult
than data retrieval
HiLCoE 84
The Retrieval Process
HiLCoE 85
The Retrieval Process Web search
engine
Web browser
Text
User
Interface
Text Operations
Query DB Manager
Operations Indexing
Module
user feedback
inverted file
query
Searching
Index
retrieved docs
Text
Database
Ranking
ranked docs
HiLCoE 86
The Retrieval Process
The user interface – think of it as the user interface
available with current IR systems including
Web search engines
It is necessary to define the text database before
any of the retrieval processes are initiated
This is usually done by the manager of the
database and includes specifying the following
the documents to be used
HiLCoE 87
Cont…
Once the logical view of the documents is
defined, the database manager builds an index
of the text
An index is a critical data structure
It allows fast searching over large volumes of
data
Different index structures might be used , but the
most popular one is the inverted file
Given the document database is indexed, the
retrieval process can be initiated
HiLCoE 88
Cont…
HiLCoE 89
Cont…
The query is then processed to obtain the retrieved
documents
Before the retrieved documents are sent to the
user, the retrieved documents are ranked
according to the likelihood of relevance
The user then examines the set of ranked
documents in the search for useful information
The user may need to reformulate query
HiLCoE 90
Cont…
At this point, he might pinpoint a subset of the
documents seen as definitely of interest and
initiate a user feedback cycle
In such a cycle, the system uses the
documents selected by the user to change
the query formulation
Hopefully, this modified query is a better
representation of the real user need
HiLCoE 91
Cont…
User
Interface queries
spider of the
Index Search
engine
Web pages
HiLCoE 92
Factors Affecting Effective
Retrival
HiLCoE 93
Factors Affecting Effective Retrival
HiLCoE 94
The User Task
Retrieval
Database
Browsing/ surfing
HiLCoE 95
Cont…
The user task: The user task might be
one of rtetrival or browsing
Retrieval
information or data
Information need (retrieval goal) is focused and
crystalized, Purposeful, Often user is sophesticated
Browsing/ surfing
Information need (retrival goal) is vague and impresise
Glancing around, Often user is naive
Both are initiated by the user
HiLCoE 96
Users
The user: anyone who need to find some
information
The user groups
group by their knowledge of the system
novice users vs. experienced users
end users vs. information specialists
group by their domain knowledge
Domain experts vs. general public
group by information needs
need to locate a particular item
need some information
need all information on a subject
HiLCoE 97
User’s Information Needs
At all levels of our life we need information (e.g.
crossing the road, health, nutrition, travel,…)
Information need is the desire to know, the desire
to fill a gap of knowledge
Example- problem: one wants to cross a road in a
high traffic area: What is the information he
needs? He needs information
About the direction people drive (left or right)
About the meanings of the traffic light (green, yellow, and
red)
Sign posts, etc ?
HiLCoE 98
Cont…
HiLCoE 99
Logical view of documents
HiLCoE 100
Document Processing Steps
HiLCoE 101
From “Modern IR” textbook
Cont..
Documents in a collection are
frequently represented through a set
of index terms or keywords
An index term is a key word (or group of
related words) which has some meaning
of its own (which usually has the
semantics of a noun)
In its more general form, an index term is
simply any word which appears in the
text of a document collection
HiLCoE 102
Cont…
HiLCoE 103
Cont…
Key words might be extracted directly from the text
of the document or
Keywords might be specified by a human expert
(this is frequently done in the information science
arena)
No matter whether these representative keywords
are derived automatically or generated by a
specialist, they provide a logical view of a document
(concise logical view)
HiLCoE 104
Cont...
HiLCoE 105
Cont...
Standard steps
Recognizing document structures (titles,
sections, paragraphs, etc.)
Break into tokens
Usually space and punctuation delimited
Special issues with some languages
The elimination of stopwords (such as
articles and connectives)
HiLCoE 106
Cont…
Conflation: The use of stemming/ morphological
analysis
Purpose
Overcome the variants of word forms by reducing all
words with the same root, i.e., (which reduces distinct
words to their common grammatical root)
Most IR systems perform stemming on both text
and query
The identification of noun groups (which
eliminates adjectives, adverbs, and verbs)
Other further operation can also be performed
Store in inverted index (to be discussed in later
chapters)
HiLCoE 107
Cont…
Such text operations reduce the
complexity of the document
representation and allow moving the
logical view from that of a full text to
that of indexed terms
Index - A list of important key words
from the documents
HiLCoE 108
Cont...
The full text is the most complete logical
view of a document, But its usage usually
implies higher computational costs
A small set of categories/ index terms
(generated automatically or by a human
specialist) provides the most concise
logical view of a document, But its usage
might lead to retrieval of poor quality
Several intermediate logical views (of a
document) might be adopted by an
information retrieval system as shown in
the figure
HiLCoE 109
Cont…
HiLCoE 110
Cont...
The index terms obtained are a description of
a document content and of its structure
Models may allow reference to the text
document
The models might also allow references to
the structure normally present in written text
(in this case we say a structured model)
Retrieval based on index terms or keywords
might be of fairly low quality
HiLCoE 111
Cont…
Two major reasons for this
The user query might be composed of too few terms which
usually implies the query context is poorly characterized
This problem is dealt with through transformations in the query
such as query expansion and user relevance feedback
The set of keywords generated for a given document might
fail to summarize its semantic content properly
This problem is dealt with through transformations in the text such
as
Identification of noun groups to be used as keywords
Stemming
The use of thesaurus
HiLCoE 112
Cont...
Given a set of index terms for a document, we
notice that not all the terms are equally useful for
describing the document contents
There are index terms that are simply vaguer
than the others
Deciding on the importance of a term for
summarizing the contents of a document is not a
trivial issue
Despite this difficulty, there are properties of an
index term
HiLCoE 113
Cont…
Examples of such properties
A word which appears in each of the one hundred
thousand documents is completely useless as an
index term because it does not tell us anything
about which documents the user night be interested
in
A word which appears in just five documents is quite
useful because it narrows down considerably the
space of documents which might be of interest to the
user
Thus, distinct index terms have varying relevance
when used to describe document contents
This effect is captured through the assignment of
numerical weights to each of the index term of a
document
HiLCoE 114
Challenges in IR
HiLCoE 115
Why is IR a Difficult Problem?
The size of the web is doubling every
year:
50 million pages in November 1995, 320
million pages in December 1997, 800 million
pages in February 1999, 1 billion pages in
2000, and growing every day
Huge amount of data (e.g., WWW) dictates
efficiency, effectiveness and user-friendliness
Thus :Any IR system needs the capability of
large scale data processing. Use of indexes
and various representations are required
HiLCoE 116
Cont…
HiLCoE 118
Cont…
Diversified user base: expert to casual
users
The users of information retrieval systems
include
Research scientists (that seek articles related to
particular experiments)
Engineers (who try to determine W/r a patent is
covering some new idea has previously been
obtained)
Attorney( who search for legal presidents)
Buyers in general (who try to obtain new product
information)
HiLCoE 119
Cont…
Information retrieval users
Have a wide variety of different information needs
(Interest), Exhibit many different backgrounds
May be led by many different reasons to use the retrieval
facilities
As a result, they require a variety of services and end
products
In other words, a system may be clumsy for an expert
user but difficult to use for a casual user
a system may return information too general to be
useful for an expert in the subject but too narrow for a
general user
HiLCoE 120
Cont…
HiLCoE 122
Other Central Concepts in
IR
HiLCoE 123
Other Central Concepts in IR
Documents
Queries
Collections
Evaluations
relevance
HiLCoE 124
Documents
Document Retrieval Model. Are IR
systems better called Document Retrieval
systems?
Document: a long string of characters contained
in a single file
• What do we mean by a document?
− Full document?
− Document surrogates?
− Pages?
• A document is a representation of some aggregation
of information, treated as a unit
HiLCoE 125
Cont…
HiLCoE 126
Cont…
HiLCoE 127
Cont…
Document Representation
Since
Documents are full of text.
Not every words of the text are meaningful for
searching/retrieval.
Even some times documents themselves do not have
identifiable attributes such as author, titles.
Documents need to be processed and
represented to a concise and identifiable
formats/structures.
HiLCoE 128
Cont…
Documents should be represented to help
users identify and receive information
from the system.
to identify subjects
to provide summaries/abstracts
to classify subject categories
HiLCoE 129
Collection
HiLCoE 130
Queries
A query is some expression of a user’s information
needs
Can take many forms
Natural language description of need
list of words
Formal query in a query language Query-Boolean (A and B
or C)
Queries may not be accurate expressions of the
information need
Differences between conversation with a person and formal
query expression
HiLCoE 131
Cont…
Given the user query, the information system has
to retrieve the documents which are related to that
query
The potentially large size of the documents
collection (e.g. the Web is composed of millions of
documents) implies that specialized indexing
techniques must be used if efficient retrieval is to
be achieved
Thus to speed up the task of matching documents
to queries, proper indexing and searching
techniques are used
HiLCoE 132
Relevance
HiLCoE 133
Why is IR Important?
HiLCoE 134
Reference Materials from the web
HiLCoE 135
Summary
HiLCoE 136
Reflection
HiLCoE 137