You are on page 1of 41

Chapter Two

Text Operations
Statistical Properties of Text
How is the frequency of different words distributed?

How fast does vocabulary size grow with the size of a corpus?

Such factors affect the performance of IR system & can be used to

select suitable term weights & other aspects of the system.

A few words are very common.

2 most frequent words (e.g. “the”, “of”) can account for about 10%

of word occurrences.

Most words are very rare.

Half the words in a corpus appear only once, called “read only once”
Sample Word Frequency Data

 The table shows the most frequently occurring words from 336,310
document collection containing 125, 720, 891 total words; out of which
508, 209 unique/single words
Zipf’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency.
•(The most commonly occurring word has rank 1, etc.)
f
Distribution of sorted word frequencies,
according to Zipf’s law

w has rank r and frequency f

r
Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e. ,
number of occurances ) of the words within a text.
 Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of
occuerence (most frequent words first), the occurence
characterstics of the vocabulary can be characterized by
the constant rank-frequency law of Zipf:
• Zipf’s Law Impact on IR
Stop words will account for a large fraction of
text so eliminating them greatly reduces
inverted-index storage costs.
More Example: Zipf’s Law
 Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency (F) R.(F/N)

1 the 69 971 0.070


2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095
Methods that Build on Zipf's Law

• Stop lists: Ignore the most frequent words


(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to
terms based on their frequency, with most
frequent words weighed less.
• Used by almost all ranking methods.
Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a text

furnishes a useful measurement of word significance.

• Luhn suggested that both extremely common and extremely

uncommon words were not very useful for indexing.

• For this, Luhn specifies two cutoff points: an upper and a lower

cutoffs based on which non-significant words are excluded


– The words exceeding/above the upper cutoff were considered to be

common

– The words below the lower cutoff were considered to be rare

– Hence they are not contributing significantly to the content of the

text

– The ability of words to discriminate content, reached a peak at a

rank order position half way between the two-cutoffs

•Luhn (1958) suggested that both extremely common and extremely


uncommon words were not very useful for document representation &
indexing.
Vocabulary size : Heaps’ Law
Heap’s law: estimates the number of vocabularies

in a given corpus

How does the size of the overall vocabulary

grow with the size of the corpus?


This determines how the size of the inverted

index will balance with the size of the corpus.


Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens

Example: from 1,000,000,000 documents, there


may be 1,000,000 distinct words. Can you agree?
Text Operations
 Not all words in a document are equally significant to represent the

contents/meanings of a document

 Some word carry more meaning than others

 Noun words are the most representative of a document content

 Therefore, need to preprocess the text of a document in a collection

to be used as index terms

 Using the set of all words in a collection to index documents creates

too much noise for the retrieval task

 Reduce noise means reduce words which can be used to refer to the document
Preprocessing is the process of controlling the size of the

vocabulary or the number of distinct words used as index

terms

 Preprocessing will lead to an improvement in the information

retrieval performance

However, some search engines on the Web omit

preprocessing

 Every word in the document is an index term


Document snippet
 “One evening Frodo and Sam were walking together in the cool

twilight. Both of them felt restless again. On Frodo suddenly the


shadow of parting had falling: the time to leave Lothlorien was near. ”
  Unstructured (bag-of-words) representation

 {(One, 1), (evening, 1), (Frodo, 2), (and, 2), (Sam, 1) (were, 1),

(walking, 1), (together, 1), (in, 1), (the, 3), (cool, 1), (twilight, 1),
(Both, 1), (of, 2), (them, 1), (felt,1), (restless, 1), (again, 1), (On, 1),
(suddenly, 1), (shadow, 1), (parting, 1), (had, 1), (falling, 1), (time, 1),
(to, 1), (leave, 1), (Lothlorien, 1), (was, 1), (near, 1)}
Weakly -structured representations
Bag of nouns

{(evening, 1), (Frodo, 2), (Sam, 1), (twilight, 1),

(shadow, 1), (parting, 1), (time, 1), (Lothlorien,


1)}
 Bag of named entities

{(Frodo, 2), (Sam, 1), (Lothlorien, 1)}


Text Operations
• Text operations is the process of text transformations in to
logical representations
• 5 main operations for selecting index terms, i.e. to choose
words/stems (or groups of words) to be used as indexing
terms:
– Lexical analysis/Tokenization of the text - digits, hyphens,
punctuations marks, and the case of letters
– Elimination of stop words - filter out words which are not useful
in the retrieval process
– Stemming words - remove affixes (prefixes and suffixes)
– Construction of term categorization structures such as thesaurus,
to capture relationship for allowing the expansion of the original
query with related terms
Generating Document Representatives
• Text Processing System
– Input text – full text, abstract or title
– Output – a document representative adequate for use in an
automatic retrieval system

• A document will be indexed by a name if one of its significant


words occurs as a member of that class.

documents Tokenization stop words stemming Thesaurus

Index
terms
Lexical Analysis/Tokenization of
Text
Change text of the documents into words to be adopted as
index terms
Objective - identify words in the text ( Digits, hyphens,

punctuation marks, case of letters)


 Numbers are not good index terms (like 1910, 1999

 Hyphen – break up the words (e.g. state-of-the-art = state of the art)-

but some words, e.g. gilt-edged, B-49 - unique words which require
hyphens
 Punctuation marks – remove totally unless significant, e.g. program

code: x.exe and xexe


 Case of letters – not important and can convert all to upper or lower
Word is a enclosed string of characters as it appears in
the text
 Term is a normalized form of the word (accounting for
morphology, spelling, etc.)
Word and term are in the same equivalence class – in
informal speech they are often used interchangeably
 Token is an instance of a word or term ocurring in a
document
 Tokens are „words” in the general sense
 But numbers, punctuation, and special characters are also
tokens
Tokenization is a process, typically automated, of
breaking down the text (one long string) into a sequence
of tokens (shorter strings)
Tokenization
Analyze text into a sequence of discrete tokens (words).
Input: “Friends, Romans and Countrymen”
Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing)
Friends
Romans
and
Countrymen
Each such token is now a candidate for an index entry,
after further processing
 Simplest approach is to ignore all numbers and
punctuation and use only case-insensitive unbroken
strings of alphabetic characters as tokens.
Exercise: Tokenization
The cat slept peacefully in the living room. It’s a very old
cat.

Mr. O’Neill thinks that the boys’ stories about Chile’s


capital aren’t amusing.
Exercise on Issues in Tokenization
• One word or multiple: How do you decide it is one token or two

or more?
– Hewlett-Packard  Hewlett and Packard as two tokens?
• state-of-the-art: break up hyphenated sequence.

• San Francisco, Los Angeles

– lowercase, lower-case, lower case ?


• data base, database, data-base

• Numbers:
• dates (3/12/91 vs. Mar. 12, 1991);

• phone numbers,

• IP addresses (100.2.86.144)
 Stop words are semantically poor terms such as articles, prepositions,

 conjunctions, pronouns, etc.

 Removal of stop words is one of the most common steps of IR text preprocessing

 Q: Why would we want to remove the stop words?

 A: Because stop words have very little meaning, they do not determine

whether a document is relevant or not

 A: Removing stop words reduces the size of vocabulary (and index) and

makes retrieval process more efficient

 A: Including stop words may lead to false positives because of stop word

matches between query and documents


Retrieval of documents may result in:
False negative (false drop): some relevant documents may not be
retrieved.
False positive: some irrelevant documents may be retrieved.
For many applications a good index should not permit any false
drops, but may permit a few false positives.

relevant irrelevant
“Type one errors”
retrieved
A B “Errors of
commission” “False

C D positives”
not
“Type two retrieved
errors” “Errors •Metrics often used to
of omission” evaluate effectiveness of
“False negatives” the system
Elimination of STOPWORD
• Stopwords are extremely common words across document
collections that have no discriminatory power
– They may occur in 80% of the documents in a collection.
– They would appear to be of little value in helping select documents
matching a user need and needs to be filtered out as potential index
terms
• Examples of stop words are articles, prepositions, conjunctions,
etc.:
– articles (a, an, the); pronouns: (I, he, she, it, their, his)
– Some prepositions (on, of, in, about, besides, against),
– conjunctions/ connectors (and, but, for, nor, or, so, yet),
– verbs (is, are, was, were),
– adverbs (here, there, out, because, soon, after) and
– adjectives (all, any, each, every, few, many, some) can also be treated
as stopwords
How to determine a list of stop words?
• One method: Sort terms (in decreasing order) by

collection frequency and take the most frequent ones

• Another method: Build a stop word list that contains

a set of articles, pronouns, etc.

– Why do we need stop lists: to exclude from index terms

entirely the commonest words


Stop words
• Stop word elimination used to be standard in older IR

systems.

• But the trend is getting away from doing this.

• Most web search engines index stop words:

– Good query optimization techniques mean you pay little at query

time for including stop words.

– Elimination of stop words might reduce recall


Normalization issues
• Normalization is the process of putting the
incoming items/tokens to a standard format
• Case-folding – converting all letters to lower case

• Need to “normalize” terms in indexed text as well as query terms into

the same form

Often best to lower case everything (queries and


documents)
• In IR, lowercasing is most practical because of the way

users issue their queries


Normalization
• Case Folding: Often best to lower case everything, since users will use

lowercase regardless of ‘correct’ capitalization. Example: We want to


match
– Republican vs. republican

– Fasil vs. fasil vs. FASIL

– Anti-discriminatory vs. ant discriminatory

– Car vs. automobile?

– Example: We want to match U.S.A. and USA,

How does Google do it?

 „C.A.T.” (information need: Caterpillar Inc.)

 returns cat (animal) as the first result


Stemming/Morphological analysis
• Stems : the base of a word, to which affixes are added.

 Reducing different forms of the same word to a common

representative form
• Stemming reduces tokens to their “root” form of words to recognize

morphological variation .
– The process involves removal of affixes (i.e. prefixes and suffixes)

with the aim of reducing variation to the same stem


– eg:- rewrite, postgraduate, unreliable and carefully, blindness

• Stemming is language dependent


 Stemming is the procedure of reducing the word to its grammatical root

 The result of stemming is not necessarily a valid word of the language

 E.g., „recognized” -> “recogniz”, “incredibly” -> “incredibl”

 Stemming removes suffixes with heuristics

 E.g., “automates”, “automatic”, “automation” will all be reduced to

„automat”

 Normalization reduces all words syntactically derived from some word to

the original word (even if the derived word has different meaning)
 E.g., „destruction” -> „destroy”

 houses” -> „house”, „tried” -> „try”


Stemming
• The final output from a conflation (reducing to the same token)

algorithm is a set of classes, one for each stem detected.

–A Stem: the portion of a word which is left after the removal of its

affixes (i.e., prefixes and/or suffixes).

–Example: ‘connect’ is the stem for {connected, connecting

connection, connections}

–Thus, [automate, automatic, automation] all reduce to 

automat
A class name is assigned to a document if and only if one of its

members occurs as a significant word in the text of the document.


– A document representative then becomes a list of class names,

which are often referred as the documents index terms/keywords.

There are basically two ways to implement stemming.

1. The first approach is to create a big dictionary that maps words to

their stems.
2. The second approach is to use a set of rules .(algorithm ) that

extract stems from words


Porter Stemmer
• Stemming is the operation of stripping the suffices from a

word, leaving its stem.

– Google, for instance, uses stemming to search for web pages containing

the words connected, connecting, connection and connections when

users ask for a web page that contains the word connect.

• In 1979, Martin Porter developed a stemming algorithm that

uses a set of rules to extract stems from words,.


Porter stemmer
• Most common algorithm for stemming English words to their
common grammatical root
• It is simple procedure for removing known affixes in English
without using a dictionary.
• To gets rid of plurals the following rules are used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
– S cats  cat
– EMENT   (Delete final ement if what remains is longer than 1
character )
replacement  replac
While step 1a gets free of plurals, step 1b removes -ed or
-ing.
e.g.
;; agreed -> agree ;; disabled ->
disable
;; matting -> mat ;; mating ->
mate
;; meeting -> meet ;; milling ->
mill
;; messing -> mess ;; meetings ->
mee
;; feed -> feedt
Stemming: challenges
May produce unusual stems that are not English words:

Removing ‘UAL’ from FACTUAL and EQUAL

May conflate (reduce to the same token) words that are

actually distinct /different.


 “computer”, “computational”, “computation” all
reduced to same token “comput”
Thesauri
• It is a lists words in groups of synonyms and related
concepts.
• Mostly full-text searching cannot be accurate, since
different authors may select different words to
represent the same concept
– Problem: The same meaning can be expressed using
different terms that are synonyms, homonyms, and related
terms
– How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?
Homonyms :- each of two or more words having the same
spelling or pronunciation but different meanings and origins.
E.g:- Right – Right, Band – Band.
Aim of Thesaurus
• Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms and
homonyms
• The aim of thesaurus is therefore:
– to provide a standard vocabulary for indexing and searching
• Thesaurus rewrite to form equivalence classes, and we index such
equivalences
• When the document contains automobile, index it under car as well
– to assist users with locating terms for proper query
formulation: When the query contains automobile, look
under car as well for expanding query
– to provide classified hierarchies that allow the broadening
and narrowing of the current request according to user needs

e.g: - car = automobile, truck, bus, taxi, motor vehicle


-color = colour, paint
Questions
Which search engines are using Text preprocessing?

Why? Or Why not. & Is Text preprocessing an

impact on users need, IR System etc? if so how? Or

Why not?

Submission date:09/04/14 E.C


Questions ?

You might also like