You are on page 1of 23

Information Storage and

Retrieval
Text Collections and IR
• Large collections of documents from various sources:
– news articles, research papers, books, digital libraries, Web pages
• Storage of text:
–Textual documents
• Searchable as text
• words are represented as ASCII/Unicode
–Image and Speech Documents:
• Scanned images and speech forms of text document, which
is not searchable as text: Texts (characters, words, etc.)
are represented as patterns
• Retrieval from Image and speech Documents: Two options
–Recognition-based retrieval: OCR or ASR is required to
convert images or speech to ASCII (may be error prone) and
• apply text IR systems on the recognized documents
–Recognition-free retrieval: retrieval from image or speech
without explicit recognition
• Search relevant documents directly from image or speech collections
What is Information Retrieval ?
• Information retrieval is the process of searching for
relevant documents from unstructured large corpus
that satisfy users information need
• IR is simply about finding information
• It focuses on providing the user with easy access to
information of their interest
formulation uses uses process
Information
Request Query Matching Index
item

is represented by a Is based on contains

Information
Relevance Collection
need
IR Processes
• Information retrieval is the process of
matching the query against the indexed
information objects
• An index is an optimized data structure
that is built on top of the information
objects
– allowing faster access for the search process
– The indexer:
• tokenizes the text (tokenization)
• removes words with little semantic value (stop-
words)
• unifies word families (stemming)
– The same is done for the query as well
IR Processes
• The IR system responds by matching information
objects, which are relevant to a query
• Information retrieval focuses on finding relevant
information rather than simple pattern matching
– Relevance
• is a subjective notion
• depends on the task being solved and its context
• can change with time (eg. new info became available)
• can change with location (eg. the most important
answer is the closest one)
• can change with the device (eg. The best answer is a
short doc that is easier to download and visualize)
IR Processes
• A retrieval strategy (model) is an
algorithm and related structures that takes
a query and a set of documents and
assigns a similarity measure between the
query and each document
– similarity represents relevance to the user
query
– Documents are ranked on the basis of their
similarity to the query
• This process can be repeated and the
query can be modified
IR Processes
Information
Retrieval

Indexing Matching Ranking Query


(for optimized (searching, (with term Modification
access) clustering) boosting) (query
expansion)

Text Analysis
(tokenization,
normalization,
stop word
removal,
stemming)
The IR Process
doc

Information need Document

Representation Representation

Query Comparison Index

Retrieved documents

Evaluation
IR as a Discipline
• IR deals with the representation, storage,
organization of, and access to information
items such as documents, webpages,
online catalogs, unstructured and semi-
structured records, multimedia objects
• It can involve range of contents and media
• The goals of IR were indexing text and
searching for useful documents in a
collection
– Much IR research focuses more specifically on
text retrieval
IR as a Discipline
• The area has grown beyond its early goals
– Nowadays research in IR includes:
• Modeling, • language,
• web search, • cross-language retrieval,
• text classification, • audio (speech and music) retrieval,
• system architecture, • image retrieval,
• user interface, • video retrieval,
• data filtering, • question answering, etc.
IR as a Discipline
• IR can be studied from two distinct but
complementary point of view
– A computer-centered: consists of
• Building up efficient indexes
• Processing user queries with high performance
• Developing ranking algorithms to improve results
• ……
– A human-centered:
• Studying the behavior of the user
• Understanding user’s need
• Determining how understanding user’s need
affects the organization and operation of retrieval
system
• …..
IR as a Tool

• IR is a tool that finds


and selects from a
collection of items a
subset that serves the
user’s purpose
Examples of IR systems
• Text-based (Lexis-Nexis, Google, FAST): Search by
keywords. Limited search using queries in natural language.
• Multimedia (QBIC, WebSeek, SaFe): Search by visual
appearance (shapes, colors,… ).
• Question answering systems (AskJeeves,
Answerbus): Search in (restricted) natural language
• Digital and virtual libraries
• Other:
– Cross language vs. multilingual information retrieval,
– Music retrieval
– Medical search engines
IR serve as Bridge
• An Information Retrieval System serves as a
bridge between the world of authors and the
world of readers/users,
– That is, writers present a set of ideas in a document
using a set of concepts. Then Users seek the IR
system for relevant documents that satisfy their
information need.

Black box
User Documents
IR System Architecture
Indexing, retrieval and ranking
IR System Architecture
• Document collection
• Document representation
– Text analysis/Operations
– Indexing – executed offline
• Query parsing and expansion
– spelling correction, normalization, stop word removal, etc.
• Retrieval and ranking – IR models
• Evaluation of the quality of the answer
• Relevance feedback – to improve ranking
– The clicks on the documents
• Formatting – consists of retrieving the title of the
documents and generating snippets for them
Issues in IR
• Document/Text representation
– what makes a “good” representation?
– how is a representation generated from text?
– what are the retrievable objects and how are they
organized?
• Information need representation
– what is an appropriate query language?
– how can interactive query formulation and refinement be
supported?
• Comparing representations
– what is a “good” similarity measure & retrieval model?
– how is uncertainty represented?
• Evaluating effectiveness of retrieval
– what are good metrics?
– what constitutes a good experimental test bed?
Information Vs Data Retrieval
• Data retrieval : the task of determining
which documents of a collection contain the
keywords in the user query
• Data retrieval system
– Relational database
– Deals with data that has a well defined structure
and semantics
– A single erroneous object among a thousand
retrieved objects means total failure
• Data retrieval does not solve the problem of
retrieving information about a subject or
topic
Information Vs Data Retrieval
Features Data Retrieval Information Retrieval
Matching Exact match Partial or best match
Query language Artificial natural
Query specification Complete Incomplete
Items wanted Matching relevant
Error response Sensitive Insensitive
Items Structured Not well structured
Information Retrieval Research areas
• Much of IR research focuses more specifically on text retrieval
But there are many other interesting areas:
–Audio retrieval, which deals with searching for speech or
music file
–Cross-language retrieval, which uses a query in one
language (say English) and finds documents in other
languages (say Amharic and Russian).
–Question-answering IR systems, which retrieve answers from
a body of text. For example, the question Who won the 1997
World Series? finds a 1997 headline World Series: Marlins
are champions.
–Image retrieval, which finds images on a given topic or
images that contain a given shape or color.
–Video retrieval, which searches for video file that the user
looking for.
Is IR just document retrieval?
• Cross-language information retrieval
• Information extraction
• Question answering
• Document Summarization
• Text classification
• Multimedia information retrieval
• Multi-database searching
• Document provenance
• Agents (information filtering, tracking, routing)
• Recommender systems
• Text mining
• …
Assignment 1
Write an overview on one of the following topics and submit it in
softcopy (also share with classmates). Your overview should
provide introduction, a typical architecture, techniques
and methods, performance achieved so far, future
research directions and reference materials you have
used
1. IR system for local languages 8. Web IR
2. Information extraction 9. Intelligent IR
10. Recommender System
3. Text Summarization 11. Document/information
4. Cross language IR provenance
5. Multimedia IR 12.Text classification.
6. Question Answering 13.Multi-database searching
7. Information Filtering

You might also like