Chapter Two: Text Operations

Chapter Two
Text Operations
Statistical Properties of Text
How is the frequency of different words distributed?
How fast does vocabulary size grow with the size of a corpus?
Such factors affect the performance of IR system & can be used to
select suitable term weights & other aspects of the system.
A few words are very common.
2 most frequent words (e.g. “the”, “of”) can account for about 10%
of word occurrences.
Most words are very rare.
Half the words in a corpus appear only once, called “read only once”
Sample Word Frequency Data
 The table shows the most frequently occurring words from 336,310
document collection containing 125, 720, 891 total words; out of which
508, 209 unique/single words
Zipf’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency.
•(The most commonly occurring word has rank 1, etc.)
f
Distribution of sorted word frequencies,
according to Zipf’s law
w has rank r and frequency f
r
Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e. ,
number of occurances ) of the words within a text.
 Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of
occuerence (most frequent words first), the occurence
characterstics of the vocabulary can be characterized by
the constant rank-frequency law of Zipf:
• Zipf’s Law Impact on IR
Stop words will account for a large fraction of
text so eliminating them greatly reduces
inverted-index storage costs.
More Example: Zipf’s Law
 Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070

2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words

(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to
terms based on their frequency, with most
frequent words weighed less.
• Used by almost all ranking methods.
Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.
• Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing.
• For this, Luhn specifies two cutoff points: an upper and a lower
cutoffs based on which non-significant words are excluded

– The words exceeding/above the upper cutoff were considered to be
common
– The words below the lower cutoff were considered to be rare
– Hence they are not contributing significantly to the content of the
text
– The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
•Luhn (1958) suggested that both extremely common and extremely

uncommon words were not very useful for document representation &
indexing.
Vocabulary size : Heaps’ Law
Heap’s law: estimates the number of vocabularies
in a given corpus
How does the size of the overall vocabulary
grow with the size of the corpus?

This determines how the size of the inverted
index will balance with the size of the corpus.

Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens
Example: from 1,000,000,000 documents, there

may be 1,000,000 distinct words. Can you agree?
Text Operations
 Not all words in a document are equally significant to represent the
contents/meanings of a document
 Some word carry more meaning than others
 Noun words are the most representative of a document content
 Therefore, need to preprocess the text of a document in a collection
to be used as index terms
 Using the set of all words in a collection to index documents creates
too much noise for the retrieval task
 Reduce noise means reduce words which can be used to refer to the document
Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index
terms
 Preprocessing will lead to an improvement in the information
retrieval performance
However, some search engines on the Web omit
preprocessing
 Every word in the document is an index term

Document snippet
 “One evening Frodo and Sam were walking together in the cool
twilight. Both of them felt restless again. On Frodo suddenly the

shadow of parting had falling: the time to leave Lothlorien was near. ”
  Unstructured (bag-of-words) representation
 {(One, 1), (evening, 1), (Frodo, 2), (and, 2), (Sam, 1) (were, 1),
(walking, 1), (together, 1), (in, 1), (the, 3), (cool, 1), (twilight, 1),
(Both, 1), (of, 2), (them, 1), (felt,1), (restless, 1), (again, 1), (On, 1),
(suddenly, 1), (shadow, 1), (parting, 1), (had, 1), (falling, 1), (time, 1),
(to, 1), (leave, 1), (Lothlorien, 1), (was, 1), (near, 1)}
Weakly -structured representations
Bag of nouns
{(evening, 1), (Frodo, 2), (Sam, 1), (twilight, 1),
(shadow, 1), (parting, 1), (time, 1), (Lothlorien,

1)}
 Bag of named entities
{(Frodo, 2), (Sam, 1), (Lothlorien, 1)}

Text Operations
• Text operations is the process of text transformations in to
logical representations
• 5 main operations for selecting index terms, i.e. to choose
words/stems (or groups of words) to be used as indexing
terms:
– Lexical analysis/Tokenization of the text - digits, hyphens,
punctuations marks, and the case of letters
– Elimination of stop words - filter out words which are not useful
in the retrieval process
– Stemming words - remove affixes (prefixes and suffixes)
– Construction of term categorization structures such as thesaurus,
to capture relationship for allowing the expansion of the original
query with related terms
Generating Document Representatives
• Text Processing System
– Input text – full text, abstract or title
– Output – a document representative adequate for use in an
automatic retrieval system
• A document will be indexed by a name if one of its significant

words occurs as a member of that class.
documents Tokenization stop words stemming Thesaurus
Index
terms
Lexical Analysis/Tokenization of
Text
Change text of the documents into words to be adopted as
index terms
Objective - identify words in the text ( Digits, hyphens,
punctuation marks, case of letters)

 Numbers are not good index terms (like 1910, 1999
 Hyphen – break up the words (e.g. state-of-the-art = state of the art)-
but some words, e.g. gilt-edged, B-49 - unique words which require
hyphens
 Punctuation marks – remove totally unless significant, e.g. program
code: x.exe and xexe

 Case of letters – not important and can convert all to upper or lower
Word is a enclosed string of characters as it appears in
the text
 Term is a normalized form of the word (accounting for
morphology, spelling, etc.)
Word and term are in the same equivalence class – in
informal speech they are often used interchangeably
 Token is an instance of a word or term ocurring in a
document
 Tokens are „words” in the general sense
 But numbers, punctuation, and special characters are also
tokens
Tokenization is a process, typically automated, of
breaking down the text (one long string) into a sequence
of tokens (shorter strings)
Tokenization
Analyze text into a sequence of discrete tokens (words).
Input: “Friends, Romans and Countrymen”
Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing)
Friends
Romans
and
Countrymen
Each such token is now a candidate for an index entry,
after further processing
 Simplest approach is to ignore all numbers and
punctuation and use only case-insensitive unbroken
strings of alphabetic characters as tokens.
Exercise: Tokenization
The cat slept peacefully in the living room. It’s a very old
cat.
Mr. O’Neill thinks that the boys’ stories about Chile’s

capital aren’t amusing.
Exercise on Issues in Tokenization
• One word or multiple: How do you decide it is one token or two
or more?
– Hewlett-Packard  Hewlett and Packard as two tokens?
• state-of-the-art: break up hyphenated sequence.
• San Francisco, Los Angeles
– lowercase, lower-case, lower case ?

• data base, database, data-base
• Numbers:
• dates (3/12/91 vs. Mar. 12, 1991);
• phone numbers,
• IP addresses (100.2.86.144)
 Stop words are semantically poor terms such as articles, prepositions,
 conjunctions, pronouns, etc.
 Removal of stop words is one of the most common steps of IR text preprocessing
 Q: Why would we want to remove the stop words?
 A: Because stop words have very little meaning, they do not determine
whether a document is relevant or not
 A: Removing stop words reduces the size of vocabulary (and index) and
makes retrieval process more efficient
 A: Including stop words may lead to false positives because of stop word
matches between query and documents

Retrieval of documents may result in:
False negative (false drop): some relevant documents may not be
retrieved.
False positive: some irrelevant documents may be retrieved.
For many applications a good index should not permit any false
drops, but may permit a few false positives.
relevant irrelevant
“Type one errors”
retrieved
A B “Errors of
commission” “False
C D positives”
not
“Type two retrieved
errors” “Errors •Metrics often used to
of omission” evaluate effectiveness of
“False negatives” the system
Elimination of STOPWORD
• Stopwords are extremely common words across document
collections that have no discriminatory power
– They may occur in 80% of the documents in a collection.
– They would appear to be of little value in helping select documents
matching a user need and needs to be filtered out as potential index
terms
• Examples of stop words are articles, prepositions, conjunctions,
etc.:
– articles (a, an, the); pronouns: (I, he, she, it, their, his)
– Some prepositions (on, of, in, about, besides, against),
– conjunctions/ connectors (and, but, for, nor, or, so, yet),
– verbs (is, are, was, were),
– adverbs (here, there, out, because, soon, after) and
– adjectives (all, any, each, every, few, many, some) can also be treated
as stopwords
How to determine a list of stop words?
• One method: Sort terms (in decreasing order) by
collection frequency and take the most frequent ones
• Another method: Build a stop word list that contains
a set of articles, pronouns, etc.
– Why do we need stop lists: to exclude from index terms
entirely the commonest words

Stop words
• Stop word elimination used to be standard in older IR
systems.
• But the trend is getting away from doing this.
• Most web search engines index stop words:
– Good query optimization techniques mean you pay little at query
time for including stop words.
– Elimination of stop words might reduce recall

Normalization issues
• Normalization is the process of putting the
incoming items/tokens to a standard format
• Case-folding – converting all letters to lower case
• Need to “normalize” terms in indexed text as well as query terms into
the same form
Often best to lower case everything (queries and

documents)
• In IR, lowercasing is most practical because of the way
users issue their queries

Normalization
• Case Folding: Often best to lower case everything, since users will use
lowercase regardless of ‘correct’ capitalization. Example: We want to

match
– Republican vs. republican
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. ant discriminatory
– Car vs. automobile?
– Example: We want to match U.S.A. and USA,
How does Google do it?
 „C.A.T.” (information need: Caterpillar Inc.)
 returns cat (animal) as the first result

Stemming/Morphological analysis
• Stems : the base of a word, to which affixes are added.
 Reducing different forms of the same word to a common
representative form
• Stemming reduces tokens to their “root” form of words to recognize
morphological variation .
– The process involves removal of affixes (i.e. prefixes and suffixes)
with the aim of reducing variation to the same stem

– eg:- rewrite, postgraduate, unreliable and carefully, blindness
• Stemming is language dependent

 Stemming is the procedure of reducing the word to its grammatical root
 The result of stemming is not necessarily a valid word of the language
 E.g., „recognized” -> “recogniz”, “incredibly” -> “incredibl”
 Stemming removes suffixes with heuristics
 E.g., “automates”, “automatic”, “automation” will all be reduced to
„automat”
 Normalization reduces all words syntactically derived from some word to
the original word (even if the derived word has different meaning)
 E.g., „destruction” -> „destroy”
 houses” -> „house”, „tried” -> „try”

Stemming
• The final output from a conflation (reducing to the same token)
algorithm is a set of classes, one for each stem detected.
–A Stem: the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).
–Example: ‘connect’ is the stem for {connected, connecting
connection, connections}
–Thus, [automate, automatic, automation] all reduce to 
automat
A class name is assigned to a document if and only if one of its
members occurs as a significant word in the text of the document.

– A document representative then becomes a list of class names,
which are often referred as the documents index terms/keywords.
There are basically two ways to implement stemming.
1. The first approach is to create a big dictionary that maps words to
their stems.
2. The second approach is to use a set of rules .(algorithm ) that
extract stems from words

Porter Stemmer
• Stemming is the operation of stripping the suffices from a
word, leaving its stem.
– Google, for instance, uses stemming to search for web pages containing
the words connected, connecting, connection and connections when
users ask for a web page that contains the word connect.
• In 1979, Martin Porter developed a stemming algorithm that
uses a set of rules to extract stems from words,.

Porter stemmer
• Most common algorithm for stemming English words to their
common grammatical root
• It is simple procedure for removing known affixes in English
without using a dictionary.
• To gets rid of plurals the following rules are used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
– S cats  cat
– EMENT   (Delete final ement if what remains is longer than 1
character )
replacement  replac
While step 1a gets free of plurals, step 1b removes -ed or
-ing.
e.g.
;; agreed -> agree ;; disabled ->
disable
;; matting -> mat ;; mating ->
mate
;; meeting -> meet ;; milling ->
mill
;; messing -> mess ;; meetings ->
mee
;; feed -> feedt
Stemming: challenges
May produce unusual stems that are not English words:
Removing ‘UAL’ from FACTUAL and EQUAL
May conflate (reduce to the same token) words that are
actually distinct /different.

 “computer”, “computational”, “computation” all
reduced to same token “comput”
Thesauri
• It is a lists words in groups of synonyms and related
concepts.
• Mostly full-text searching cannot be accurate, since
different authors may select different words to
represent the same concept
– Problem: The same meaning can be expressed using
different terms that are synonyms, homonyms, and related
terms
– How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?
Homonyms :- each of two or more words having the same
spelling or pronunciation but different meanings and origins.
E.g:- Right – Right, Band – Band.
Aim of Thesaurus
• Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms and
homonyms
• The aim of thesaurus is therefore:
– to provide a standard vocabulary for indexing and searching
• Thesaurus rewrite to form equivalence classes, and we index such
equivalences
• When the document contains automobile, index it under car as well
– to assist users with locating terms for proper query
formulation: When the query contains automobile, look
under car as well for expanding query
– to provide classified hierarchies that allow the broadening
and narrowing of the current request according to user needs
e.g: - car = automobile, truck, bus, taxi, motor vehicle

-color = colour, paint
Questions
Which search engines are using Text preprocessing?
Why? Or Why not. & Is Text preprocessing an
impact on users need, IR System etc? if so how? Or
Why not?
Submission date:09/04/14 E.C

Questions ?

Chapter Two: Text Operations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter Two: Text Operations

Uploaded by

Copyright:

Available Formats

Chapter Two

Such factors affect the performance of IR system & can be used to

select suitable term weights & other aspects of the system.

A few words are very common.

Most words are very rare.

w has rank r and frequency f

1 the 69 971 0.070

• Stop lists: Ignore the most frequent words

furnishes a useful measurement of word significance.

• Luhn suggested that both extremely common and extremely

uncommon words were not very useful for indexing.

cutoffs based on which non-significant words are excluded

– The words below the lower cutoff were considered to be rare

– Hence they are not contributing significantly to the content of the

– The ability of words to discriminate content, reached a peak at a

rank order position half way between the two-cutoffs

•Luhn (1958) suggested that both extremely common and extremely

How does the size of the overall vocabulary

grow with the size of the corpus?

index will balance with the size of the corpus.

Example: from 1,000,000,000 documents, there

 Some word carry more meaning than others

 Noun words are the most representative of a document content

 Therefore, need to preprocess the text of a document in a collection

to be used as index terms

 Using the set of all words in a collection to index documents creates

too much noise for the retrieval task

vocabulary or the number of distinct words used as index

 Preprocessing will lead to an improvement in the information

However, some search engines on the Web omit

 Every word in the document is an index term

twilight. Both of them felt restless again. On Frodo suddenly the

{(evening, 1), (Frodo, 2), (Sam, 1), (twilight, 1),

(shadow, 1), (parting, 1), (time, 1), (Lothlorien,

{(Frodo, 2), (Sam, 1), (Lothlorien, 1)}

• A document will be indexed by a name if one of its significant

documents Tokenization stop words stemming Thesaurus

punctuation marks, case of letters)

 Hyphen – break up the words (e.g. state-of-the-art = state of the art)-

code: x.exe and xexe

Mr. O’Neill thinks that the boys’ stories about Chile’s

• San Francisco, Los Angeles

– lowercase, lower-case, lower case ?

 conjunctions, pronouns, etc.

 Q: Why would we want to remove the stop words?

whether a document is relevant or not

makes retrieval process more efficient

matches between query and documents

collection frequency and take the most frequent ones

• Another method: Build a stop word list that contains

a set of articles, pronouns, etc.

– Why do we need stop lists: to exclude from index terms

entirely the commonest words

• But the trend is getting away from doing this.

• Most web search engines index stop words:

– Good query optimization techniques mean you pay little at query

time for including stop words.

– Elimination of stop words might reduce recall

• Need to “normalize” terms in indexed text as well as query terms into

the same form

Often best to lower case everything (queries and

users issue their queries

lowercase regardless of ‘correct’ capitalization. Example: We want to