You are on page 1of 22

InformationStorage

Information Storageand
and
Retrieval
Retrieval

1
2
Contents:
What is information?
What are the Sources of information?

Why we use information?

3
WHAT IS INFORMATION RETRIEVAL?
 It is a software program that deals with the organization,
storage, retrieval, and evaluation of information from
document repositories, particularly textual information.
 It is the activity of obtaining material that can usually be
documented on an unstructured nature i.e. usually text
which satisfies an information need from within large
collections which is stored on computers.
 For example, Information Retrieval can be when a user enters a
query into the system.
 The IR system assists the users in finding the information
they require but it does not explicitly return the answers to
the question. 4
CONT’D….
 It notifies regarding the existence and location of
documents that might consist of the required information.
 It also extends support to users in browsing or filtering
document collection or processing a set of retrieved
documents.
 The system searches over billions of documents stored on
millions of computers.
 A spam filter, manual or automatic means are provided by
Email program for classifying the mails so that it can be
placed directly into particular folders.
5
CONT’D…
 An IR system has the ability to represent, store, organize,
and access information items.
 A set of keywords are required to search.
 Keywords are what people are searching for in search
engines.
 These keywords summarize the description of the
information.

6
INFORMATION VS DATA RETRIEVAL

Information Retrieval Data Retrieval

 The software program that  Data retrieval deals with


deals with the organization, obtaining data from a database
storage, retrieval, and management system such as
evaluation of information from ODBMS. It is A process of
document repositories identifying and retrieving the
particularly textual information data from the database, based
on the query provided by user
 Retrieves information about a
or application.
subject.
 Determines the keywords in the
 Small errors are likely to go
user query and retrieves the data
unnoticed.
 A single error object means
total failure. 7
CONT’D….

Information Retrieval Data Retrieval

 Not always well structured and  Has a well-defined structure


is semantically ambiguous. and semantics.
 Does not provide a solution to  Provides solutions to the user
the user of the database of the database system.
system.  The results obtained are exact
 The results obtained are matches
approximate matches.  Results are unordered by
 Results are ordered by relevance.
relevance.  It is a deterministic model.
 It is a probabilistic model
8
TEXT COLLECTIONS AND IR
Information is organized into (a large number of)
documents
 Large collections of documents from various sources: books, journal
articles, conference papers, newspapers, magazines, digital libraries,
Web pages, etc.

Sample Statistics of Text Collections


 Google, www.google.com, Search Engines offers access
to over 50 billion Web documents.
 Bing, www.bing.com, covers over 250 million Web pages.
 It performs more than 8.5 billion search queries each day
in more than 40 languages

9
10
STORAGE OF TEXT
Textual documents
 Searchable as text
 words are represented as ASCII/Unicode

Image Documents:
 Scanned image of text document, which is not searchable as text:
Texts (characters, words, etc.) are represented as patterns of pixels

Retrieval from Document Images:


Two options
 Recognition-based retrieval: OCR is required to convert document
images to ASCII (may be error prone) and then
apply text IR systems on the recognized documents
 Recognition-free retrieval: retrieval from document images without
explicit recognition.
Search relevant documents directly from image collections.
11
WHAT IS INFORMATION RETRIEVAL ?
Information retrieval is the
process of searching for relevant
documents from unstructured
large corpus that satisfy users
information need.
 Itis a tool that finds and selects from a
collection of items a subset that serves
the user’s purpose

• Much IR research focuses more specifically on text


retrieval. But there are many other interesting areas:
Cross-language retrieval, Audio (Speech & Music)
retrieval, Question-answering, Image retrieval, Video 12

retrieval.
EXAMPLES OF IR SYSTEMS
 Text-based (Lexis-Nexis, Google, FAST): Search by
keywords. Limited search using queries in natural language.
 Multimedia (QBIC, WebSeek, SaFe): Search by visual
appearance (shapes, colors,… ).
 Question answering systems (AskJeeves,
Answerbus): Search in (restricted) natural language
 Digital and virtual libraries
 Other:
 Cross language vs. multilingual information retrieval,
 Music retrieval
 Medical search engines
13
INFORMATION RETRIEVAL SERVE AS
BRIDGE
 An Information Retrieval System serves as a bridge
between the world of authors and the world of
readers/users,
 That is, writers present a set of ideas in a document using a
set of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.

Black box
User Documents
14
TYPICAL IR SYSTEM ARCHITECTURE

Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
Relevant Documents .
. 15
IR SYSTEM VS. WEB SEARCH SYSTEM

Web
Spider Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Relevant Documents
. 16
THE RETRIEVAL PROCESS
The
The Retrieval
Retrieval Process
Process
 Itis necessary to define the text database before any of the
retrieval processes are initiated
 The text operations transform the original documents & the

information needs and generate a logical view of them


 Once the logical view of the documents is defined, the

database module builds an index of the text


 An index is a critical data structure
 It allows fast searching over large volumes of data

 Different index structures might be used, but the most


popular one is the inverted file (more on this later) as
indicated in the slide
 Given the document database is indexed, the retrieval

process can be initiated


18
The
The Retrieval
Retrieval Process
Process
The user first specifies a user need which is then parsed &
transformed by the same text operation applied to the text
Next the query operations is applied before the actual query, which
provides a system representation for the user need, is generated
The query is then processed to retrieve documents
Before the retrieved documents are sent to the user, the retrieved
documents are ranked according to the likelihood of relevance
The user then examines the set of ranked documents in the
search for useful information. Two choices for the user:
(i) reformulate query, run on entire collection or
 (ii) reformulate query, run on result set
At this point, the user might pinpoint a subset of the
documents seen as definitely of interest & initiate a user
feedback cycle
In such a cycle, the system uses the documents selected by the user to
change the query formulation.
Hopefully, this modified query is a better representation of the real user 19
need
ISSUES THAT ARISE IN IR
 Text document representation
 what makes a “good” representation?
 how is a representation generated from text?
 what are the retrievable objects & how are they organized?

 Information need representation


 what is an appropriate query language?
 how can interactive query formulation & refinement be supported?

 Comparing representations
 what is a “good” similarity measure & retrieval model?
 how is uncertainty represented?

 Evaluating effectiveness of retrieval


 what are good metrics?
 what constitutes a good experimental test bed?
20
INFORMATION RETRIEVAL FOCUS
AREAS
Much of IR research focuses more specifically on text
retrieval. But there are many other interesting areas:
Audio retrieval, which deals with searching for speech or
music file.
Cross-language retrieval, which uses a query in one
language (say English) and finds documents in other
languages (say Amharic and Russian).
Question-answering IR systems, which retrieve answers from
a body of text. For example, the question Who won the 1997
World Series? finds a 1997 headline World Series: Marlins
are champions.
Image retrieval, which finds images on a given topic or
images that contain a given shape or color.
Video retrieval, which searches for video file that the user 21
looking for.
IS IR JUST DOCUMENT RETRIEVAL ?
 Cross-language information retrieval
 Information extraction
 Question answering
 Document Summarization
 Text classification
 Multimedia information retrieval
 Multi-database searching
 Document provenance
 Agents (information filtering, tracking, routing)
 Recommender systems
 Text mining
22

You might also like