Professional Documents
Culture Documents
Lecture 1
8/23/2011
Introduction
1
What is Information Retrieval?
Information retrieval is the science of searching for information in
documents, searching for documents themselves, searching for
metadata which describe documents, or searching within databases,
whether relational stand-alone databases or hypertextually-networked
databases such as the World Wide Web.
Wikipedia
The study of methods and structures used to represent and access
information.
Witten et al.
The IR definition can be found in this book.
Salton
IR deals with the representation, storage, organization of, and access to
information items.
Salton
Information retrieval is the term conventionally, though somewhat
inaccurately, applied to the type of activity discussed in this volume.
van Rijsbergen
2
Manning et al.
3
How about: What you do when you use
Google
5
Social Media
7
Course Plan
8
Course Plan
Administrivia
Work/Grading
Programming assignments 30%
Quizzes 30%
Project 30%
Participation 10%
Textbook
Introduction to Information Retrieval ---
Manning, Raghavan and Schütze
10
Textbook
11
Administrivia
13
CAETE
16
Administrivia
17
Questions?
18
Simple Unstructured Data Scenario
20
Term-Document Matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
22
Answers to Query
23
Bigger Collections
24
The Matrix
25
Inverted Index
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
26
Inverted Index
Brutus 2 4 8 16 32 64 128
Calpurnia 1 2 3 5 8 13 21 34
Caesar 13 16
Tokenizer
Token stream. Friends Romans countrymen
More on
these later. Linguistic modules
Modified tokens. friend roman countryman
Indexer friend 2 4
roman 1 2
Inverted index.
countryman 13 28
16
Indexer steps: Token sequence
First generate a sequence of <token,
Document-ID> pairs from the
stream of documents being indexed.
Doc 1 Doc 2
Lists of
docIDs
Terms
and
counts
32
Pointers
Indexing
33
Next time
34