Professional Documents
Culture Documents
Text Operations
Statistical Properties of Text
How is the frequency of different words distributed?
How fast does vocabulary size grow with the size of a corpus?
2 most frequent words (e.g. “the”, “of”) can account for about 10%
of word occurrences.
Half the words in a corpus appear only once, called “read only once”
Sample Word Frequency Data
The table shows the most frequently occurring words from 336,310
document collection containing 125, 720, 891 total words; out of which
508, 209 unique/single words
Zipf’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency.
•(The most commonly occurring word has rank 1, etc.)
f
Distribution of sorted word frequencies,
according to Zipf’s law
r
Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e. ,
number of occurances ) of the words within a text.
Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of
occuerence (most frequent words first), the occurence
characterstics of the vocabulary can be characterized by
the constant rank-frequency law of Zipf:
• Zipf’s Law Impact on IR
Stop words will account for a large fraction of
text so eliminating them greatly reduces
inverted-index storage costs.
More Example: Zipf’s Law
Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency (F) R.(F/N)
• For this, Luhn specifies two cutoff points: an upper and a lower
common
text
in a given corpus
contents/meanings of a document
Reduce noise means reduce words which can be used to refer to the document
Preprocessing is the process of controlling the size of the
terms
retrieval performance
preprocessing
{(One, 1), (evening, 1), (Frodo, 2), (and, 2), (Sam, 1) (were, 1),
(walking, 1), (together, 1), (in, 1), (the, 3), (cool, 1), (twilight, 1),
(Both, 1), (of, 2), (them, 1), (felt,1), (restless, 1), (again, 1), (On, 1),
(suddenly, 1), (shadow, 1), (parting, 1), (had, 1), (falling, 1), (time, 1),
(to, 1), (leave, 1), (Lothlorien, 1), (was, 1), (near, 1)}
Weakly -structured representations
Bag of nouns
Index
terms
Lexical Analysis/Tokenization of
Text
Change text of the documents into words to be adopted as
index terms
Objective - identify words in the text ( Digits, hyphens,
but some words, e.g. gilt-edged, B-49 - unique words which require
hyphens
Punctuation marks – remove totally unless significant, e.g. program
or more?
– Hewlett-Packard Hewlett and Packard as two tokens?
• state-of-the-art: break up hyphenated sequence.
• Numbers:
• dates (3/12/91 vs. Mar. 12, 1991);
• phone numbers,
• IP addresses (100.2.86.144)
Stop words are semantically poor terms such as articles, prepositions,
Removal of stop words is one of the most common steps of IR text preprocessing
A: Because stop words have very little meaning, they do not determine
A: Removing stop words reduces the size of vocabulary (and index) and
A: Including stop words may lead to false positives because of stop word
relevant irrelevant
“Type one errors”
retrieved
A B “Errors of
commission” “False
C D positives”
not
“Type two retrieved
errors” “Errors •Metrics often used to
of omission” evaluate effectiveness of
“False negatives” the system
Elimination of STOPWORD
• Stopwords are extremely common words across document
collections that have no discriminatory power
– They may occur in 80% of the documents in a collection.
– They would appear to be of little value in helping select documents
matching a user need and needs to be filtered out as potential index
terms
• Examples of stop words are articles, prepositions, conjunctions,
etc.:
– articles (a, an, the); pronouns: (I, he, she, it, their, his)
– Some prepositions (on, of, in, about, besides, against),
– conjunctions/ connectors (and, but, for, nor, or, so, yet),
– verbs (is, are, was, were),
– adverbs (here, there, out, because, soon, after) and
– adjectives (all, any, each, every, few, many, some) can also be treated
as stopwords
How to determine a list of stop words?
• One method: Sort terms (in decreasing order) by
systems.
representative form
• Stemming reduces tokens to their “root” form of words to recognize
morphological variation .
– The process involves removal of affixes (i.e. prefixes and suffixes)
„automat”
the original word (even if the derived word has different meaning)
E.g., „destruction” -> „destroy”
–A Stem: the portion of a word which is left after the removal of its
connection, connections}
automat
A class name is assigned to a document if and only if one of its
their stems.
2. The second approach is to use a set of rules .(algorithm ) that
– Google, for instance, uses stemming to search for web pages containing
users ask for a web page that contains the word connect.
Why not?