You are on page 1of 71

UNIT 1: INTRODUCTION AND DATA PRE-PROCESSING

What is Information Retrieval?


• The meaning of the term information retrieval can be very broad.
• Just getting a credit card out of your wallet so that you can type in the card number
is a form of information retrieval.
• However, as an academic field of study, INFORMATION RETRIEVAL might be
defined thus:
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually
text) that satisfies an information need from within large collections (usually stored on computers).

2
What is Information Retrieval?
• Information retrieval is concerned with representing, searching, and manipulating large
collections of human-language data.
• It is the activity of obtaining resources relevant to an information need from a collection of
information resources.
• Process begins when user enters a query into the system.
• Queries are formal statements of information needs.
• User queries are matched against the database information. Depending on the application
the data objects may be text documents, images, audio, or video.
• Most IR systems compute a numeric score based on how well each object matches the
query and ranks objects based on this.
• The top ranking objects are then shown first. The process may be iterated if the user wishes
to refine the query.

3
Why IR?
• Every online database, every search engine, every search engine, everything that
is searched online is based in some way or the other on principles developed in IR.

4
Basic assumptions of IR
• Collection: fixed set of documents
• Goal: retrieve documents with information that is relevant to the user’s information
need and helps the user complete a task
• Using the Boolean Retrieval Model means that the information need must be
translated into a Boolean expression: n terms combined with AND, OR, and NOT
operators
• We want to support ad hoc retrieval: provide documents relevant to an arbitrary
user information need.

5
NLP DB

IR

ML
Types of Information Retrieval Systems
Information retrieval systems can also be distinguished by the scale at which they operate, and it is useful to
distinguish three prominent scales:
1. Web Search: The system has to provide search over billions of documents stored on millions of computers.
Distinctive issues are :
• Needing to gather documents for indexing.
• Being able to build systems that work efficiently at this enormous scale.
• Handling particular aspects of the web, such as the exploitation of hypertext.
• Not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings, given the
commercial importance of the web.

2. Personal Information Retrieval: In the last few years, consumer operating systems have integrated
information retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search). Email programs
also provide text classification. They provide a spam (junk mail) filter, and commonly also provide either manual
or automatic means for classifying mail
Distinctive issues are:
• Handling the broad range of document types on a typical personal computer
• Making the search system maintenance free and sufficiently lightweight in terms of startup, processing.
• Disk space usage that it can run on one machine without annoying its owner.

7
Types of Information Retrieval Systems
3. Enterprise, institutional, and domain-specific search: Retrieval might be
provided for collections such as a corporation’s internal documents, a database of
patents, or research articles on biochemistry. In this case, the documents will
typically be stored on centralized file systems and one or a handful of dedicated
machines will provide search over the collection.

8
An example information retrieval problem
• A fat book which many people own is Shakespeare’s Collected Works.
Suppose you wanted to determine which plays of Shakespeare contain the words Brutus
AND Caesar AND NOT Calpurnia.
• One way to do that is to start at the beginning and to read through all the text, noting for each
play whether it contains Brutus and Caesar and excluding it from consideration if it contains
Calpurnia.
• The simplest form of document retrieval is for a computer to do this sort of linear scan through
documents.
• This process is commonly referred to as GREP - after the Unix command grep, which performs
this process.
• Grepping through text can be a very effective process, especially given the speed of modern
computers, and often allows useful possibilities for wildcard pattern matching through the use of
regular expressions.
• With modern computers, for simple querying of modest collections (the size of Shakespeare’s
Collected Works is a bit under one million words of text in total), you really need nothing more.
9
An example information retrieval problem
However, this might not be efficient in situations such as:
1. To process large document collections quickly.
2. To allow more flexible matching operations.
3. To allow ranked retrieval.

Solution:
INDEX: The way to avoid linearly scanning the texts for each query is to index the
documents in advance.
Let us stick with Shakespeare’s Collected Works, and use it to introduce the basics
of the Boolean retrieval model.

10
Boolean Retrieval (Term Incidence Matrix)
• Suppose we record for each document – here a play of Shakespeare’s – whether it
contains each word out of all the words Shakespeare used (Shakespeare used
INCIDENCE MATRIX about 32,000 different words).
• The result is a binary term-document incidence TERM matrix (refer next slide)
• Terms are the indexed units, they are usually words, and for the moment you can
think of them as words, but the information retrieval literature normally speaks of
terms because some of them, such as perhaps I-9 or Hong Kong are not usually
thought of as words.
• Now, depending on whether we look at the matrix rows or columns, we can have a
vector for each term, which shows the documents it appears in, or a vector for each
document, showing the terms that occur in it.

11
Boolean Retrieval (Term Incidence Matrix)

12
Boolean Retrieval (Term Incidence Matrix)

13
Boolean Retrieval (Term Incidence Matrix)

14
Boolean Retrieval (Term Incidence Matrix)

15
Moving on from Boolean Retrieval
• Our goal is to develop a system to address the ad hoc retrieval task.

In it, a system aims to provide documents from within the collection that are relevant to an
arbitrary user information need, communicated to the system by means of a one-off, user-
initiated query.

• INFORMATION NEED: topic about which the user desires to know more
• QUERY : what the user conveys to the computer in an attempt to communicate the
information need.
• RELEVANCE: A document is relevant if it is one that the user perceives as containing
information of value with respect to their personal information need. Our example above
was rather artificial in that the information need was defined in terms of particular words,
whereas usually a user is interested in a topic like “pipeline leaks” and would like to find
relevant documents regardless of whether they precisely use those words.
16
Relevance in IR

17
Effectiveness
• EFFECTIVENESS: To assess the effectiveness of an IR system (i.e., the quality of
its search results), a user will usually want to know two key statistics about the
system’s returned results for a query:
• PRECISION: What fraction of the returned results are relevant to the information need?
• RECALL: What fraction of the relevant documents in the collection were returned by the
system?

18
Inverted Index
• INVERTED INDEX : The name is actually redundant- an index always maps back
from terms to the parts of a document where they occur.
• Inverted index, or sometimes inverted file, has become the standard term in
information retrieval.
• The basic idea of an inverted index is shown in Figure on next page
• DICTIONARY: We keep a dictionary of terms (sometimes also referred to as a
vocabulary or VOCABULARY lexicon. We will use dictionary for the data structure
and vocabulary LEXICON for the set of terms). Then for each term, we have a list
that records which documents the term occurs in.
• Each item in the list – which records that a term appeared in a document (and,
later, often, the positions in the document) – is conventionally called a posting.
• POSTING: The list is then called a POSTINGS LIST (or inverted list), and all the
postings lists taken together are referred to as POSTINGS
19
Inverted Index

20
Inverted Index

21
Inverted Index Construction

22
Indexer steps: Tokenization

23
Indexer steps: Sort

24
Indexer steps: Dictionary and postings

25
Where do we pay in storage?

26
Exercise

27
Solution

28
Exercise
Q. Consider these documents:
Doc 1 : breakthrough drug for schizophrenia
Doc 2 : new schizophrenia drug
Doc 3 : new approach for treatment of schizophrenia
Doc 4 : new hopes for schizophrenia patients
a. Draw the term-document incidence matrix for this document collection.
b. Draw the inverted index representation for this collection,

29
Processing Boolean queries

30
Processing Boolean queries
• How do we process a query using an inverted index and the basic Boolean
SIMPLE CONJUNCTIVE retrieval model?
• Consider processing the simple conjunctive query: Brutus AND Calpurnia over the
inverted index partially shown in Figure earlier.
• We:
1. Locate Brutus in the Dictionary
2. Retrieve its postings
3. Locate Calpurnia in the Dictionary
4. Retrieve its postings
5. Intersect the two postings lists.
• The intersection operation is the crucial one: we need to efficiently intersect
postings lists so as to be able to quickly find documents that contain both
POSTINGS.

31
Processing Boolean queries

32
Processing Boolean queries

33
Processing Boolean queries

34
Processing Boolean queries

35
Processing Boolean queries

36
Processing Boolean queries

37
Processing Boolean queries

38
Processing Boolean queries

39
Processing Boolean queries

40
Processing Boolean queries

41
Processing Boolean queries

42
Processing Boolean queries

43
Faster postings merges: Skip pointers/Skip lists

44
Faster postings merges: Skip pointers/Skip lists

45
Faster postings merges: Skip pointers/Skip lists

46
Faster postings merges: Skip pointers/Skip lists

47
Faster postings merges: Skip pointers/Skip lists

48
Faster postings merges: Skip pointers/Skip lists

49
Faster postings merges: Skip pointers/Skip lists

50
Pre-processing steps: Tokenization
Definitions
• Word – A delimited string of characters as it appears in the text.
• Term – A “normalized” word (case, morphology, spelling etc); an equivalence class
of words.
• Token – An instance of a word or term occurring in a document.
• Type – The same as a term in most cases: an equivalence class of tokens.

51
Normalization
• Need to “normalize” words in indexed text as well as query terms into the same
form.
• Example: We want to match U.S.A. and USA We most commonly implicitly define
equivalence classes of terms.
• Alternatively: do asymmetric expansion
• window → window, windows
• windows → Windows, window
• Windows (no expansion)
• More powerful, but less efficient
• Why don’t you want to put window, Window, windows, and Windows in the same
equivalence class?

52
Normalization: Other languages
• Normalization and language detection interact.
• PETER WILL NICHT MIT. → MIT != mit
• He got his PhD from MIT. → MIT = mit

53
Tokenization: Recall construction of inverted index
• Each token is a candidate for a postings entry.
• What are valid tokens to emit?

54
Exercises
• In June, the dog likes to chase the cat in the barn. – How many word tokens? How
many word types? Why tokenization is difficult ?
• even in English. Tokenize: Mr. O’Neill thinks that the boys’ stories about Chile’s
capital aren’t amusing.

55
Tokenization problems: One word or two? (or several)

• Hewlett-Packard
• State-of-the-art
• co-education
• the hold-him-back-and-drag-him-away maneuver
• data base
• San Francisco
• Los Angeles-based company
• cheap San Francisco-Los Angeles fares
• York University vs. New York University

56
Numbers
• 3/20/91
• 20/3/91
• Mar 20, 1991
• B-52
• 100.2.86.144
• (800) 234-2333
• 800.234.2333
• Older IR systems may not index numbers
. . . . . . but generally it’s a useful feature.
• Google example

57
Ambiguous segmentation in Chinese

58
Case folding
• Reduce all letters to lower case.
• Even though case can be semantically meaningful
• capitalized words in mid-sentence
• MIT vs. mit
• Fed vs. fed
. . . It’s often best to lowercase everything since users will use lowercase
regardless of correct capitalization.

59
Stop words
• Stop words = extremely common words which would appear to be of little value in
helping select documents matching a user need.
• Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that,
the, to, was, were, will, with
• Stop word elimination used to be standard in older IR systems.
• But you need stop words for phrase queries, e.g. “King of Denmark”
• Most web search engines index stop words.

60
Lemmatization
• Reduce inflectional/variant forms to base form.
• Example: am, are, is → be
• Example: car, cars, car’s, cars’ → car
• Example: the boy’s cars are different colors → the boy car be different color
• Lemmatization implies doing “proper” reduction to dictionary headword form (the
lemma).
• Inflectional morphology (cutting → cut) vs. derivational morphology (destruction →
destroy)

61
Stemming
• Definition of stemming: Crude heuristic process that chops off the ends of words in
the hope of achieving what “principled” lemmatization attempts to do with a lot of
linguistic knowledge.
• Language dependent
• Often inflectional and derivational
• Example for derivational: automate, automatic, automation all reduce to automat

62
Porter algorithm
• Most common algorithm for stemming English
• Results suggest that it is at least as good as other stemming options
• Conventions + 5 phases of reductions
• Phases are applied sequentially
• Each phase consists of a set of commands.
• Sample command: Delete final ement if what remains is longer than 1 character
• replacement → replac
• cement → cement
• Sample convention: Of the rules in a compound command, select the one that
applies to the longest suffix.

63
Three stemmers: A comparison
• Sample text: Such an analysis can reveal features that are not easily visible from
the variations in the individual genes and can lead to a picture of expression that is
more biologically transparent and accessible to interpretation
• Porter stemmer: such an analysi can reveal featur that ar not easili visibl from the
variat in the individu gene and can lead to a pictur of express that is more biolog
transpar and access to interpret
• Lovins stemmer: such an analys can reve featur that ar not eas vis from th vari in th
individu gen and can lead to a pictur of expres that is mor biolog transpar and
acces to interpres
• Paice stemmer: such an analys can rev feat that are not easy vis from the vary in
the individ gen and can lead to a pict of express that is mor biolog transp and
access to interpret

64
Does stemming improve effectiveness?
• In general, stemming increases effectiveness for some queries, and decreases
effectiveness for others.
• Queries where stemming is likely to help:
• [tartan sweaters], [sightseeing tour san francisco]
• (equivalence classes: {sweater,sweaters}, {tour,tours})
• Porter Stemmer equivalence class oper contains all of operate operating operates
operation operative operatives operational.
• Queries where stemming hurts: [operational AND research], [operating AND
system], [operative AND dentistry]

65
Exercise: What does Google do?
• Stop words
• Normalization
• Tokenization
• Lowercasing
• Stemming
• Non-latin alphabets
• Umlauts
• Compounds
• Numbers

66
Part of Speech Tagging
• Tagging is a kind of classification that may be defined as the automatic assignment
of description to the tokens.
• Here the descriptor is called tag, which may represent one of the part-of-speech,
semantic information and so on.
• Now, if we talk about Part-of-Speech (PoS) tagging, then it may be defined as the
process of assigning one of the parts of speech to the given word. It is generally
called POS tagging.
• In simple words, we can say that POS tagging is a task of labelling each word in a
sentence with its appropriate part of speech.
• We already know that parts of speech include nouns, verb, adverbs, adjectives,
pronouns, conjunction and their sub-categories.

67
Part of Speech Tagging
Steps Involved:
• Tokenize text
• Assign POS Tags

68
Part of Speech Tagging
Abbreviation Meaning
CC coordinating conjunction TO infinite marker
CD cardinal digit (to)
DT determiner UH interjection
EX existential there (goodbye)
FW foreign word VB verb (ask)
IN preposition/subordinating conjunction VBG verb gerund
JJ adjective (large) (judging)
JJR adjective, comparative (larger) VBD verb past tense
JJS adjective, superlative (largest) (pleaded)
LS list market VBN verb past
participle
MD modal (could, will) (reunified)
NN noun, singular (cat, tree)
NNS noun plural (desks) VBP verb, present
tense not 3rd
NNP proper noun, singular (sarah) person
NNPS proper noun, plural (indians or americans) singular(wrap)
PDT predeterminer (all, both, half)
VBZ verb, present
POS possessive ending (parent\ 's) tense with 3rd
PRP personal pronoun (hers, herself, him,himself) person singular
(bases)
PRP$ possessive pronoun (her, his, mine, my, our )
WDT wh-determiner
RB adverb (occasionally, swiftly) (that, what)

RBR adverb, comparative (greater) WP wh- pronoun


(who)
RBS adverb, superlative (biggest)
WRB wh- adverb
RP particle (about) (how)

69
Wildcard Queries
• Wildcard queries are used in any of the following situations:
(1) the user is uncertain of the spelling of a query term (e.g., Sydney vs. Sidney, which leads to the
wildcard query S*dney);
(2) the user is aware of multiple variants of spelling a term and (consciously) seeks documents
containing any of the variants (e.g., color vs. colour);
(3) the user seeks documents containing variants of a term that would be caught by stemming, but is
unsure whether the search engine performs stemming (e.g., judicial vs. judiciary, leading to the
wildcard query judicia*);
(4) the user is uncertain of the correct rendition of a foreign word or phrase (e.g., the query
Universit* Stuttgart).
• A query such as mon* is known as a trailing wildcard query , because the * symbol occurs only once, at the end
of the search string.
• A search tree on the dictionary is a convenient way of handling trailing wildcard queries: we walk down the tree
following the symbols m, o and n in turn, at which point we can enumerate the set   W   of terms in the
dictionary with the prefix mon. Finally, we use |W| lookups on the standard inverted index to retrieve all
documents containing any term in  W  .

70
Wildcard Queries
• But what about wildcard queries in which the * symbol is not constrained to be at the end of the search string?
• Before handling this general case, we mention a slight generalization of trailing wildcard queries.
• First, consider leading wildcard queries, or queries of the form *mon.
• Consider a reverse B-tree on the dictionary - one in which each root-to-leaf path of the B-tree corresponds to a term in the
dictionary written backwards: thus, the term lemon would, in the B-tree, be represented by the path root-n-o-m-e-l.
• A walk down the reverse B-tree then enumerates all terms R in the vocabulary with a given prefix.
• In fact, using a regular B-tree together with a reverse B-tree, we can handle an even more general case: wildcard queries in
which there is a single * symbol, such as se*mon.
• To do this, we use the regular B-tree to enumerate the set  W     of dictionary terms beginning with the prefix se, then the
reverse B-tree to enumerate the set   R   of terms ending with the suffix mon.
• Next, we take the intersection  W ∩ R  of these two sets, to arrive at the set of terms that begin with the prefix se and
end with the suffix mon.
• Finally, we use the standard inverted index to retrieve all documents containing any terms in this intersection.
We can thus handle wildcard queries that contain a single * symbol using two B-trees, the normal B-tree and a reverse B-tree.

71

You might also like