Information Retrieval on the Web

Basics, Concepts and Models

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
What is Information Retrieval (IR)?

 Generic Definition(s): The techniques of storing and recovering
and often disseminating recorded data especially through the use of a
computerized system.

 Wikipedia Definition: the science of searching for documents, for
information within documents and for metadata about documents, as
well as that of searching relational databases and the WWW.

 IR Textbook definition: finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers)

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
What is Information Retrieval (IR)?
 IR Textbook definition: finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers)
 the retrieved documents aim at satisfying a user information need
usually expressed in natural language.

 Focus on dealing with -
 Documents, unstructured, text, large?
 Information need?
 Store, search, find?
 On the World Wide Web?
 In relational databases?
 In other repositories?

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
Data and Information

 Data:
 Unorganized and unprocessed facts;
 static;
 a set of discrete facts about events

 Information:
 Aggregation of data that makes decision-making easier

 Knowledge:
 derived from information in the same way information is derived from
data.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
Information Retrieval vs. Databases
 Information Retrieval  Databases
 Retrieve all object satisfying some
 Retrieve all objects relevant to
well defined conditions
some information need
 E.g.
 E.g. SELECT id FROM document
Query: “semantic web” WHERE title LIKE ‘%semantic web%
 Result: list of documents  Result: Well-defined result set

Dr. Sowmya Kamath S, Dept of IT, 20-Sep-16
NITK Surathkal
Why IR?

Devices

More than 2.4 billion users
Data Nearly a trillion pages
More than 5 million terabytes of data

Users
WWW

Static Social IoT

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
Why IR?

 “80% of business is conducted on unstructured info”
 “85% of all data stored is held in unstructured format”
 “7 million Web pages are being added every day”

 “Unstructured data doubles every three months”

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
Recent IR History
 2000’s onwards….
 Link analysis for Web Search
 Google, Bing,Yahoo
 Automated Information Extraction
 Whizbang, Fetch, Burning Glass
 Question Answering
 TREC Q/A track, IBM Watson, Wolfram Alpha, Evie.
 Multimedia IR
 Image, Video, Audio and music (Baidu, TinEye,YouTube….)
 Cross-Language IR
 DARPA Tides
 Document Summarization Engines
 Google KnowledgeGraph, SenseBot.
 Learning to Rank
 AI, NLP, Machine Learning

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR System Architecture

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models

 Any IR system is based on an IR model.

 The model defines …
 a query language,
 an internal representation of queries,
 an internal representation of documents,
 a ranking function which associates a real number with each query–
document pair.
 Optional: A mechanism for relevance feedback
 Notion of relevance can be binary or continuous (i.e. ranked retrieval)

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval

 The simplest (and arguably oldest) IR model

 Documents = sets of words (index terms)

 Query language= Boolean expressions over index terms

 Binary ranking function, i.e. 0/1-valued

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval

 Retrieval is based on membership in one or more sets using Boolean
Connectives
 “Find all documents indexed by the word ‘ta’!”
 “Find all documents indexed by the word ‘ta’ AND ‘tc’!”

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval (contd.)

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval (contd.)

 In Boolean Models, documents are usually indexed by an Inverted Index.
 For each term t, it stores a list of all documents that contain t.
 enables fast query processing.

 Using the inverted index, queries of the type “Show me all documents
containing term X (and/or/… Y) ” can be answered quickly.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval (contd.)

 For example, generate an inverted index for a document collection
with some sample documents as below.
 Doc 1= “that’s one small step for a man, a giant leap for mankind”
 Doc n = “Gandhi’s small step was a turning point for India”

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval (contd.)

 Steps in Inverted Index construction –
 Tokenizing and preprocessing documents
 Normalize list of document-wise tokens

 Generate postings

 Sort postings

 Create postings lists, determine document frequency

 Split the result into dictionary and postings file  INVERTED INDEX

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval (contd.)
 For example, consider an inverted index for some document
collection as below.
 Doc 1= “that’s one small step for a man, a giant leap for mankind”
 Doc n = “Gandhi’s small step was a turning point for India”

Dictionary postings

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval (contd.)

 For the sample document collection considered –
 Doc 1= “that’s one small step for a man, a giant leap for mankind”
 Doc n = “Gandhi’s small step was a turning point for India”

 Query1 = “step AND mankind”
 Result set: {Doc 1}

 Query2 = “step OR mankind”
 Result set: {Doc 1, Doc n}

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval (contd.)

 To match natural language better, “BUT NOT” can be used instead of
“AND NOT”.
 Query4 = “step BUT NOT India”

 Use “OF” to search for subsets of a given size:
 Query5 = “2 of {step, mankind, India}”
is equivalent to
 Query5 = “(step AND mankind) OR (step AND India) OR (mankind
AND India)”

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval (contd.)

 For computing unions and intersections of sets -

 Example:
 result of “mankind AND step” = (result of “mankind”) ∩ (result of
“step”)
 result of “mankind OR step” = (result of “mankind”) ∪ (result of
“step”)

 Idea: Convert all queries to disjunctive (or conjunctive) normal form

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval (contd.)
• Each Boolean query can be rewritten in a Disjunctive Normal Form
• q = ta  (tb   tc)
qdnf = (ta  tb  tc )  (ta  tb   tc )  (ta   tb   tc )
qdnf = (1,1,1)  (1,1,0)  (1,0,0)

 Each disjunction represents an ideal set of documents
 The query is satisfied by a document if such document is contained in
a disjunction term

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval (contd.)

 The Boolean model is in reality much more a data (instead of information)
retrieval model.

 Pros:
 Boolean expressions have precise semantics
 Structured queries
 For expert users, intuitivity
 Simple and neat formalism  great attention in past years and was
adopted by many of the early commercial bibliographic systems

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Boolean Retrieval (contd.)

 The Boolean model is in reality much more a data (instead of information)
retrieval model.

 Cons:
 frequently it is not simple to translate an information need into a Boolean
expression.
 Most users find it difficult and awkward to express their query requests in
terms of Boolean expressions.
 No ranking.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Bag of Words Representation

 A document is typically represented by a bag of words (unordered
words with frequencies).
 Bag = set that allows multiple occurrences of the same element.

 User specifies a set of desired terms with optional weights:
 Weighted query terms:
 Q = < database 0.5; text 0.8; information 0.2 >
 Unweighted query terms:
 Q = < database; text; information >
 No Boolean conditions specified in the query.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Bag of Words Representation (contd.)

 Standard case:
 Vocabulary (bag) = set of all the words occurring in the collection’s
documents
 Each document is represented by the words it contains.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Bag of Words Representation (contd.)

 Any document can be represented by an incidence vector.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Bag of Words Representation (contd.)

 Sometimes, N-gram model is also used to consider word order in
documents.
 As Bag-of-word model is an orderless document representation, only
the counts of words matters.
 the n-gram model can be used to store this spatial information within
the text.

 Example: a bigram Bag of words model
 Doc1 = [ “That’s one", “one small", “small step", “step for", “for a", “a
man", “man a", “a giant”, “giant leap”, “leap for”, “for mankind”]

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Bag of Words Representation (contd.)

 Pros:
 Simple set-theoretic representation of documents
 Efficient storage and retrieval of individual terms
 IR models using the bag of words representation work suitably well!

 Cons:
 Word order gets lost
 Very different documents could have similar representations
 Document structure (e.g. headings) and metadata is ignored

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Vector Space Model (Salton, 1968)

 represents documents and queries as vectors in the term space
 Term weights are used to compute the degree of similarity (or score)
between each document stored in the system and the user query

 Documents whose content (document terms) correspond most closely to
the content of the query (query terms) are judged to be the most
relevant.

 Fundamental premise for continuous relevance feedback mechanism.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Vector Space Model (contd.)

 A document dj and a user query q are represented as t-dimensional vectors
 t is the total number of index terms in the system
 Each term is identified by a base in the n-dimensional space

 The query vector q is defined as q = (w1,q, w2,q , . . . ,wt,q)

 The vector for a document dj is represented by dj = (d1,j, d2,j , . . . ,dt,j)
 wt,q and dt,j can assume positive values {0,1}
 1 if the term is present, 0 otherwise

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Vector Space Model (contd.)

 A document dj is represented as a document vector
N
d j   w ij t i
i1

Where,
ti  [0,0,0,....,1,....,0]


The degree of similarity of the document dj with regard to the query q is

the correlation between the vectors q and dj

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models
Vector Space Model (contd.)

Graphical Representation
Example:
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + T3 5

D1 = 2T1+ 3T2 + 5T3

Q = 0T1 + 0T2 + T3

2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
7 similarity? Distance? Angle?
T2
Projection?

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models - Vector Space Model (contd.)
Evaluating vector similarity

 Idea: to use a measure of similarity between documents
 If d1 is near d2, then d2 is near d1
 If d1 is near d2, and d2 is near d3, then d1 is not far from d3
 No document is closer to d than d itself

 Euclidean distance
 Magnitude of difference vector between two vectors | d1 – d2|
 Problem of normalization

 The traditional method of determining similarity is to use the angle
between the compared vectors.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models - Vector Space Model (contd.)
Evaluating vector similarity

 Cosine Similarity : distance between vectors d1 and d2 captured by the
cosine of the angle x between them.


t
j 1
wqj wij
SC (q, d i )  cos( ) 
 j 1 ij  j 1 qj
t 2 2t
( w ) ( w ) t3

1

D1
Q
2 t1

t2 D2

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models - Vector Space Model (contd.)
Evaluating vector similarity

 An example –
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3

 CosSim(D1 , Q) = (2*0+3*0+5*2) / (4+9+25)(0+0+4) = 0.81

 CosSim(D2 , Q) = (3*0+7*0+1*2) / (9+49+1)(0+0+4) = 0.13

D1 is 6 times better than D2 using cosine similarity

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models - Vector Space Model (contd.)
Evaluating vector similarity

 Other similarity measures:
 Inner product
 Jaccard and Dice similarity
 Overlap coefficient
 Silhouette coefficient
 Conversion from a distance
 Kernel functions

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models - Vector Space Model (contd.)
Weighting Terms in VSM

 If a document talks more about a topic, then it is a better match.
 This applies even when we only have a single query term.
 Document relevant if it has many occurrences of the term(s)

 Idea: Assign to each term a weight which is proportional to its
importance both in the document and in the document collection.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models - Vector Space Model (contd.)
Weighting Terms in VSM

 Term Weights: Term Frequency

 More frequent terms in a document are more important, i.e. more
indicative of the topic.
fi,d = frequency of term i in document d

 May want to normalize term frequency (tf) by dividing by the frequency
of the most common term in the document:
tfi,d = fi,d / maxi{fi,d}

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models - Vector Space Model (contd.)
Weighting Terms in VSM

 Term Weights: Inverse Document Frequency
 Terms that appear in many different documents are less indicative of
overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
where, N: total number of documents

 An indication of a term’s discrimination power.
 Log used to dampen the effect relative to tf.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models - Vector Space Model (contd.)
Weighting Terms in VSM

 Term Frequency-Inverse Document Frequency (Tf-idf)

 Assign a tf.idf weight to each term j in each document d:
wi ,d  tf i ,d  idf i

 tfi,d  frequency of term i in document d
 idfi  inverse document frequency of term I
 wi,d
 Increases with the number of occurrences within a doc
 Increases with the rarity of the term across the whole corpus

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models - Vector Space Model (contd.)
Weighting Terms in VSM

 Example (simplified)
 Query q = “gold silver truck”
 Document collection
 d1 = “Shipment of gold damaged in a fire”
 d2 = “Delivery of silver arrived in a silver truck”
 d3 = “Shipment of gold arrived in a truck”

 n=3, t=11
 If a term appears in only one documents, its idf is log10(3/1) = 0.447
 If a term appears in two documents, its idf is log10 (3/2) = 0.176
 If a term appears in all three documents, its idf is log10 (3/3) = 0

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
IR Models - Vector Space Model (contd.)
Weighting Terms in VSM

idfa = 0 idfin = 0 idfof = 0
idfarrived = 0.176 idfshipment = 0.176 idftruck = 0.176 idfgold = 0.176
idfdamaged = 0.477 idfsilver = 0.477 idfdelivery = 0.477 idffire = 0.477
docid a arriv- dama delive fire gold in of shipm silver truck
ed ged ry ent

d1 0 0 .477 0 .477 .176 0 0 .176 0 0

d2 0 .176 0 .477 0 0 0 0 0 .954 .176

d3 0 .176 0 0 0 .176 0 0 .176 0 .176

q 0 0 0 0 0 .176 0 0 0 .477 .176

SC (q, d1) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477) + (0.176)(0.176) +
(0)(0) + (0)(0) + (0)(0.176) + (0.477)(0) + (0.176)(0) = 0.031
SC (q, d2) = (0.954)(0.477) + (0.176)(0.176) = 0.486
SC (q, d3) = (0.176)(0.176) + (0.176)(0.176) = 0.063
Final Ranking  d2 > d3 > d1
Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
Merits of Vector Space Models

 Simple, mathematically based approach.
 Considers both local (tf) and global (idf) word occurrence frequencies.
 Provides partial matching and ranked results.
 Tends to work quite well in practice despite obvious weaknesses.
 Allows efficient implementation for large document collections.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
Problems with Vector Space Model

 Missing semantic information (e.g. word sense).
 Missing syntactic information (e.g. phrase structure, word order,
proximity information).
 Assumption of term independence (e.g. ignores synonomy).
 Lacks the control of a Boolean model (e.g., requiring a term to appear
in a document).
 Given a two-term query “A B”, may prefer a document containing A
frequently but not B, over a document that contains both A and B, but
both less frequently.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16
Further reading…

 Salton, Gerard, and Michael J. McGill. "Introduction to modern
information retrieval." (1986).

 Frakes, William B., and Ricardo Baeza-Yates. "Information retrieval:
data structures and algorithms." (1992).

 Baeza-Yates, Ricardo, and Berthier Ribeiro-Neto. Modern information
retrieval. Vol. 463. New York: ACM press, 1999.

Dr. Sowmya Kamath S, Dept of IT, NITK Surathkal 20-Sep-16