You are on page 1of 12

IR – Introduction Created by Chethan.

Information Retrieval
ISiM Syllabus : MISM 623 – Information Retrieval Systems
Course Objectives - This course examines information retrieval within the
context of full-text datasets. The students should be able to understand and
critique existing information retrieval systems and to design and build
information retrieval systems themselves. The course will introduce students to
traditional methods as well as recent advances in information retrieval (IR),
handling and querying of textual data. The focus will be on newer techniques of
processing and retrieving textual information, including hypertext documents
available on the World Wide Web.

Course Outline
Topics covered include:
• IR Models
o Boolean Model
o Vector Space Model
o Relational DBMS
o Probabilistic Models
o Language Models

• Web Information Retrieval


o citation network analysis
o social collaboration (PageRank and HITS algorithms)

• Term Indexing
o Zipf's Law
o term weighting

• Searching and Data Structures


o Inverted files to support Boolean and Vector Models
o Clustering
• non-hierarchical
• single pass
• reallocation
o hierarchical agglomerative
o String Searching
o Tries, binary tries, binary digital tries, suffix trees, etc.

• Retrieval Effectiveness Evaluation


o Recall, Precision, Fallout
o Comparing systems using average precision

Course Readings: (Chethan is using)


1. Modern Information Retrieval / by Ricardo Baeza-Yates and Berthier Ribeiro-
Neto, 2001
2. Introduction to Information Retrieval / Christopher D. Manning, Prabhakar

ISiM
IR – Introduction Created by Chethan.M

Example of IR: Just getting a credit card out of your wallet so that you can type
in the card number is a form of information retrieval.

What is Information Retrieval?


Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from within
large collections (usually stored on computers).

Motivation:
Information retrieval deals with the representation, storage, organization of
& access to information items. The representation & organization of the
information items should provide the user with easy access to the information in
which he is interested. Unfortunately, characterization of the user information
need is not a simple problem.
Given the user query, the key goal of an IR system (Search Engine) is to retrieve
information which might be useful or relevant to the user. The emphasis is on the
retrieval of information as opposed to the retrieval of data.

Information Retrieval is…..


 The indexing and retrieval of textual documents.
 Concerned firstly with retrieving relevant documents to a query.
 Concerned secondly with retrieving from large sets of documents efficiently.
 Selectivity
 Finding some desired info in a store of information
 IR = select from source process
 IR and Literature searching (finding document)

Information Retrieval System:


“An information retrieval system is a device interposed between a potential
user of information & information collection itself. For a given information
problem, the purpose of the system is to capture wanted items & to filter out
unwanted item”.

Information retrieval systems can be distinguished by the scale at which they


operate, and it is useful to distinguish three prominent scales.
 In web search, the system has to provide search over billions of
documents stored on millions of computers.
 At the other extreme is personal information retrieval. In the last
few years, consumer operating systems have integrated information
retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant
Search).
 In between is the space of enterprise, institutional, and domain-
specific search, where retrieval might be provided for collections such
as a corporation’s internal documents, a database of patents, or research
articles on biochemistry.

ISiM
IR – Introduction Created by Chethan.M

Data Retrieval:
 Which documents contain a set of keywords
 Well defined semantics
 A single erroneous object implies failure.

Information Retrieval:
 Information about a subject or topic
 Semantics is frequently loose
 Small errors are tolerated
 NLP retrieval & non-structure data
 Ranking & Relevance

IR System:
 Interpret contents of information items.
 Generate a Ranking which reflects relevance.
 Notion of relevance is most important.

Data Retrieval VS Information Retrieval

Databases Information Retrieval


What we’re Structured data. Clear Mostly unstructured. Free text
Retrieving Semantics based on a with some Metadata
formal model.
Queries Formally defined queries. Vague, imprecise information
we’re posing Unambiguous needs (often expressed in
Natural language)
Results We Exact. Always in a formal Sometimes relevant, often not.
get sense.
Interaction One-shot Queries Interaction is important
with system (Relevance feedback).

Text Database VS Database

Text Database Database


1. Emphasize to Retrieval processing Transaction Processing
2. Non-Data update Data Update
3. Non-Data Integrity Data Integrity
4. Non-Data Structure Data Structure
 Book  Student Record
 Web page  Registration Data

ISiM
IR – Introduction Created by Chethan.M

History (Past)

• 1960-70’s:
– Initial exploration of text retrieval systems for “small” corpora of
scientific abstracts, and law and business documents.
– Development of the basic Boolean and vector-space models of
retrieval.
– Prof. Salton and his students at Cornell University are the leading
researchers in the area.

• 1980’s:
– Large document database systems, many run by companies:
• Lexis-Nexis
• Dialog
• MEDLINE

• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista

• 1990’s continued:
– Organized Competitions
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization & Clustering

• 2000’s
– Link analysis for Web Search
• Google
– Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
– Question Answering
• TREC Q/A track

ISiM
IR – Introduction Created by Chethan.M

• 2000’s continued:
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
– Document Summarization

Present
Source of data
 Electronic Library
 Document of University
 Data Online (web site)

Example
 AltaVista
 Google
 Etc.

Past, Present and Future


• Library is first Organization for IR
• index assign by an academic and private
• Searching technique (past : in library)
– Title , subject
– Hierarchies search system (e.g. Dewey Decimal), Controlled
vocabularies, Collections of abstracts
• Searching technique (present : in library)
– Department (of a faculty) , Term index
– to develop format in User interface
– Electronic service
– Hypertext service

Related Research Areas of IR (Future)


• Electronic Commerce on Web (Digital Library Online)
• Database Management
• Library and Information Science
• Artificial Intelligence (AI)
• Natural Language Processing (NLP)
• Machine Learning (ML)

Typical IR Task
• Given:
– A corpus of textual natural-language documents.

ISiM
IR – Introduction Created by Chethan.M

– A user query in the form of a textual string.


• Find:
– A ranked set of documents that are relevant to the query.

Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
.
Documen .

Relevance
• Relevance is a subjective judgment and may include:
– Being on the proper subject.
– Being timely (recent information).
– Being authoritative (from a trusted source).
– Satisfying the goals of the user and his/her intended use of the
information (information need).
• Much of IR depends upon idea that
– Similar vocabulary -> relevant to same queries
• Usually look for documents matching query words
• “Similar” can be measured in many ways
– String matching/comparison

ISiM
IR – Introduction Created by Chethan.M

– Same vocabulary used


– Probability that documents arise from same model
– Same meaning of text
Keyword Search
• Simplest notion of relevance is that the query string appears verbatim in
the document.
• Slightly less strict notion is that the words in the query appear frequently
in the document, in any order (bag of words).

Problems with Keywords


• May not retrieve relevant documents that include synonymous terms.
– “restaurant” vs. “café”
– “PRC” vs. “China”
• May retrieve irrelevant documents that include ambiguous terms.
– “bat” (baseball vs. mammal)
– “Apple” (company vs. fruit)
– “bit” (unit of data vs. act of eating)

Intelligent IR
• Taking into account the meaning of the words used.
• Taking into account the order of words in the query.
• Adapting to the user based on direct or indirect feedback.
• Taking into account the authority of the source.

IR Basic Concepts

• The User Task


– Retrieval
• information or data
• purposeful
– Browsing
• glancing around
• F1; cars, Le Mans, France tourism

ISiM
IR – Introduction Created by Chethan.M

Retrieval

Database

Browsing

Fig: Interaction of the user with the retrieval system through distinct tasks.

• Document representation viewed as a continuum: logical view of


documents might shift
• Document set to term index
• Indexing
Automatic
A Specialist
• Full text : all occurrence word in document
• select keyword
Stop word
Stemming

ISiM
IR – Introduction Created by Chethan.M

Two IR main Functions:

1. Indexing (System perspective)


- Text processing
- Index construction

2. Retrieval (User perspective)


- User interface
- Query processing
- Searching from index (index lookup)
- Search result ranking

IR System: (1) Indexing

IR System: (2) Retrieval

ISiM
IR – Introduction Created by Chethan.M

ISiM
IR – Introduction Created by Chethan.M

IR System Architecture

User Interface
Text
User Text Operations
Need
Logical View
Query Database
User Indexing
Operations Manager
Feedback
Inverted
Query Searching Index File
Text
Database
Ranked Retrieved
Ranking
Docs Docs

Fig: The Process of retrieving information.

IR System Components

• Text Operations forms index words (tokens).


– Stopword removal
– Stemming
• Indexing constructs an inverted index of word to document
pointers.
• Searching retrieves documents that contain a given query token
from the inverted index.
• Ranking scores all retrieved documents according to a
relevance metric.
• User Interface manages interaction with the user:
– Query input and document output.
– Relevance feedback.
– Visualization of results.
• Query Operations transform the query to improve retrieval:
– Query expansion using a thesaurus.
– Query transformation using relevance feedback.

ISiM
IR – Introduction Created by Chethan.M

References:
1. Modern Information Retrieval / by Ricardo Baeza-Yates and Berthier Ribeiro-
Neto, 2001
2. Introduction to Information Retrieval / Christopher D. Manning, Prabhakar
3. Intelligent Information Retrieval and Web Search , Raymond Mooney,
University of Texas at Austin
4. Introduction to Information Retrieval (IR), T.Keerati Boonchote

ISiM