You are on page 1of 23

Information Retrieval (IR)

• Information retrieval (IR) is finding material


(usually documents) of an unstructured nature
(usually text) that satisfies an information
need from within large collections (usually
stored on computers).
• Main Categories
1. Web Search
2. Personalized IR
3. Enterprise/Institutional/Domain-Specific Search
INFORMATION RETRIEVAL
• Structured Vs Unstructured Data
• Database Vs IR
• “I’m sorry, I can only look up your order if you can give me
your Order ID”
• APPLICATIONS OF IR
Search Engines
NEWS RETRIEVAL

VIDEOS RETRIEVAL
Language Translation Technology
Recommender Systems
Online Social Networks
Product Review Sites
Email Spam Detection
Search Results Clustering
Search Query Suggestions
Many Other Tasks …
• Novelty Detection
• Book Detection
• Blog Recommendation
• Plagiarism Detection
• Opinion Detection
• Author Profiling
• etc etc etc
An Example of IR Problem
• How to Search in a Document  Grepping
• Challenges for Grepping
– Too large data collections
– Query, “Pakistan NEAR Zardari” not practical
with grepping
– Not suitable for Ranked Retrieval

Term-Document Incidence Matrix

Query
Boolean IR
• Query terms combined with AND, OR & NOT
• Ad-HOC IR
– In adhoc IR, system aims to provide documents from withtin the
collection that are relevant to an arbitrary users information need,
communicated to the system by means of a one-off, user initiated
query.
• Information Need / Query
• Relevance
• Effectiveness (the quality of search results)
– Precision … Recall
Some Concepts
• Term–Document Incidence Matrix : Extremely
sparse i.e. very few non-zero entries. A much better
representation is to record only the things that do occur,
• Inverted Index

• Dictionary is also known as “vocabulary” or “lexicon”.


• To gain the speed benefits of indexing at retrieval time, index is built in
advance
Building an Inverted Index
• Collect the documents to be indexed
• Tokenize the text, turning each document
into a list of tokens
• Do linguistic processing
• Index the documents that each term occurs in
by creating an inverted index, consisting of a
dictionary and a postings
Inverted Index
• Document Identifier (DocId)
• Sorting
• Document Frequency
Assignment No: 01
• Exercises 1.1, 1.2 and 1.3
• To be checked personally in my office .. Hand-
written – between 26 Sept to 03 Oct 2013
• Questions could be asked about your
assignment done so do it yourself
• Contact at: Saad.Missen@Gmail.Com in case
of queries if any
2nd Lecture
Processing Boolean Queries
• Simple Conjunctive Queries (BRUTUS AND
CALPURNIA)

Posting Lists Intersection/ Postings Merging

O(x + y) time for


operations
Algorithm for merging two posting lists
Query Optimization
• “least amount of work for answering a query”
• Order in which posting lists are accessed is important
• What is the best order for query processing?
• Standard Order: Process terms in order of increasing
document frequency (i.e. start with the terms having less DF).
If we start by intersecting the two smallest posting lists then
all intermediate results must be no bigger than the smallest
posting list and so less work is to be done
• Therefore, the query “Brutus AND Caesar AND Calpurnia”
should be processed as “(Calpurnia AND Brutus) AND Caesar”
Extended Boolean IR Model
• Ranked Retrieval Model
• Free-Text Queries
• Use of NEAR operator in Extended Boolean
• More intelligent models need to have
information about Term Frequency in their
Framework .. Will be discussed later
Assignment No: 02
• Search a Stemmer and Demonstrate it in the
class on your own laptop using the text file
uploaded on class Facebook Group
• Due Date: 10 Oct 2013
End of First Chapter

You might also like