Professional Documents
Culture Documents
INSY 2063: IR
Chapter 1:
Information Storage and Retrieval
Questions
• Clay tablets
cloud
Computers
What is Information, storage and retrieval?
• Information:
Data that have been processed and has meaning of itself and
the meaning is useful but does not have to be.
?
Information Retrieval(IR)
• The term Information Retrieval was first coined by Calvin
Moore (1950)
• Example :
Employee Manager salary
A B 80000
Example
• Structured data allows for the expressive queries like:
1 Abebe Beka 2
2 Bona Dedefo 2
3 Chala Boru 6
Cont……….
• Unstructured Data: does not have a clear, overt semantic
structure(e.g. free text on web page, video, audio)
Unstructured Information
data Retrieval
Generally;
• The user can also often control the output in terms of, for
example, number of retrieved documents to display and of
highlighting search terms.
Goal of IR
Remark
Apache Solr
Lemur
Terrier
Rapid Miner
Generally;
• Is a set of rules and procedures, as operated by humans and/or
machines, for doing some or all of the following operations
• Consists of:
Acts as an assistant
Major functions of an IR systems
– Analyze contents of information items
– Vocabulary subsystem
• In indexing we need to use controlled vocabulary i.e., a list of selected
subject terms to represent a document. It is based on a vocabulary that
the indexing is updated
– DBMS:
– IR systems
• DBMS
– IR system
• Incomplete
Relationship
Text Operations
6, 7
logical view logical view
Query DB Manager
Operations Indexing
Module
user feedback
5 8
inverted file
query
Searching
Index
8
retrieved docs
Text
Database
Ranking
ranked docs
2
For queries
• For queries, the query has arisen as a result of an information
need on the part of the user
Chapter 2:
Automatic Term Selection and
Term Weighting
Definition of Term Selection
• If full text representation is adopted then all words are used for
indexing (not as such efficient as it will have an overhead, time
and space)
Index term
• Is also called keyword
• Assumption
• Index terms can be extracted from the title, abstract and text of
the document
Indexing
Is a critical process
– Non-weighted indexing
– Weighted indexing
Indexing
• Non-weighted indexing
Indexing specificity
• Should we use general index terms or more specific terms?
• Ways to do indexing
– Manual
– Indexing vocabulary
– Collection characteristics
Advantages of Manual Indexing
• Labor intensive
– Information overload
• Enormous amount of information is being generated from
day to day activities
– Cost effectiveness
• Human indexing is expensive and labor intensive.
Current Procedures for Automatic Indexing
• Generating document representatives through automatic indexing
involves
– Use of stoplist
Documents
Tokenizing
text break into words
Noise reduction
words stoplist Feature
normalization
non-stoplist stemming*
words
*Indicates
optional stemmed term weighting*
operation words
• Some are good, some are bad and some are indifferent
• It was Luhn (1957) who first suggested that certain words could
be automatically extracted from texts to represent their content
• Much of text analysis has been built on the original idea of Luhn
Automatic Text Analysis
• Luhn’s proposal
“The frequency of word occurrences in an article furnishes a useful
measure of word significance…”
Still today, the search engines that operate on the Internet index the
documents based on this principle
Automatic Text Analysis
• Luhn’s observation
– He noted that high frequency words tend to be common, non
content bearing words