Professional Documents
Culture Documents
Information Retrieval
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)
Information Retrieval
03/28/24 2
Document Corpus
LEXIS/NEXIS: (http://www.lexisnexis.com/)
Claims 7 terabytes, 1.7 billion documents, 1.5 million subscribers,
11,400 databases; > 200,000 searches per day; 9 mainframes, 300
Unix servers, 200 NT servers.
03/28/24 3
Document Corpus
03/28/24 4
Information Retrieval Systems ?
03/28/24 7
Structure of an IR System
User Documents
Black box
Given:
A corpus of document collections (text, image, video, audio)
published by various authors.
A user information need in the form of a query.
03/28/24 10
Typical IR System Architecture
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
Documents .
.
03/28/24 11
Web Search System
Document
Web Spider corpus
Query IR
String System
1. Page1
2. Page2 Ranked
3. Page3
Documents
.
.
03/28/24 12
Overview of the Retrieval Process
03/28/24 13
Issues that arise in IR
Text representation:
What makes a “good” representation? The use of free-text or content-bearing
index-terms?
How is a representation generated from text?
What are retrievable objects and how are they organized?
Information needs representation:
What is an appropriate query language?
How can interactive query formulation and refinement be supported?
Comparing representations:
What is a “good” model of retrieval?
How is uncertainty represented?
Evaluating effectiveness of retrieval:
What are good metrics?
What constitutes a good experimental test bed?
03/28/24 14
Detail View of the Retrieval Process
User
User need Interface
Text
Text Operations
Logical view Logical view
Formulate
Indexing
User feedback Query
Query Inverted file Text
Database
Searching Index file
Retrieved docs
Ranked docs
Ranking
03/28/24 15
Focus in IR System Design
03/28/24 16
Subsystems of IR system
Searching:
Is an online process that scans document corpus to find relevant
documents that matches users query.
03/28/24 17
Statistical Properties of Text
How fast does vocabulary size grow with the size of a corpus?
Such factors affect the performance of IR system & can be used
to select suitable term weights & other aspects of the system.
03/28/24 18
Text Operations
03/28/24 19
Text Operations
03/28/24 20
Indexing Subsystem
Documents documents
Assign document identifier
text
Tokenize
tokens document IDs
Stop list
non-stop list tokens
Stemming & Normalize
stemmed terms
Term weighting
Weighted terms
Index
03/28/24 21
Example: Indexing
friend 2 4
Index File Indexer roman 1 2
(Inverted file). 13 16
countryman
03/28/24 22
Index File
query
Parse query
query tokens
Ranked
document set Stop list
non-stop list tokens
Ranking
Stemming & Normalize
Relevant
document set stemmed terms
Similarity Query
terms Term weighting
Measure
Index terms
Index file
03/28/24 25
IR Models - Basic Concepts
The weight wij quantifies the importance of the index term for
describing the document contents.
vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with
the document dj.
03/28/24 27
Mapping Documents & Queries
03/28/24 29
IR Models
03/28/24 30
The Boolean Model: Example
System-centered studies:
Given documents, queries, and relevance judgments.
Try several variations of the system.
Measure which system returns the “best” hit list.
User-centered studies:
Given several users, and at least two retrieval systems.
Have each user try the same task on both systems.
Measure which system satisfy the “best” for users
information need.
03/28/24 33
Evaluation Criteria
03/28/24 34
Retrieval scenario
B.
= Relevant
document
C.
= Irrelevant
document D.
E.
F.
03/28/24 35
Measuring Retrieval Effectiveness
Retrieved
A B
Not retrieved C D
Recall:
Is percentage of relevant documents retrieved from the database
in response to users query. (A / A + C)
Precision:
Is percentage of retrieved documents that are relevant to the
query. (A / A + B)
03/28/24 36
Query Language
Document
Query
String corpus
1. Doc2
Query Ranked 2. Doc1
Documents 3. Doc4
Reformulation
.
1. Doc1 .
1. Doc1
2. Doc2
2. Doc2
3. Doc3
Feedback 3. Doc3
.
.
.
.
03/28/24 40
Challenges for IR researchers and
practitioners
Technical challenge: what tools should IR systems provide to
allow effective and efficient manipulation of information within
such diverse media of text, image, video and audio?
03/28/24 43
Thank You !!!
03/28/24 44