0% found this document useful (0 votes)

554 views45 pages

Text Operations for Information Retrieval

There are 5 main text operations used in information retrieval systems: 1) Lexical analysis to break text into tokens, 2) Elimination of stop words, 3) Stemming words, 4) Construction of thesaurus structures, and 5) Term weighting. The frequency distribution of words follows Zipf's Law, where a few words are very common and most words are rare. Methods like stop lists and term weighting build on Zipf's Law. Luhn and Heap's Laws also help analyze vocabulary size and growth.

Uploaded by

Abdo Ababor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

554 views45 pages

Text Operations for Information Retrieval

Uploaded by

Abdo Ababor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1

TEXT OPERATIONS

By Abdo Ababor

2021
Statistical Properties of Text
Text operations
5 main operations for selecting index terms,
• Lexical analysis/Tokenization of the text - digits,
hyphens, punctuations marks, and the case of letters
• Elimination of stop words - filter out words which are
not useful in the retrieval process
• Stemming words - remove affixes (prefixes and suffixes)
• Construction of term categorization structures such
as thesaurus, to capture relationship for allowing the
expansion of the original query with related terms
• Term weighting – taking into account term and document
frequency
2
Statistical Properties of Text
How is the frequency of different words distributed?
How fast does vocabulary size grow with the size of a corpus?
 Such factors affect the performance of IR system & can be used to
select suitable term weights & other aspects of the system.
A few words are very common.
 2 most frequent words (e.g. “the”, “of”) can account for
about 10% of word occurrences.
Most words are very rare.
 Half the words in a corpus appear only once, called “read
only once”
 Called a “heavy tailed” distribution, since most of the
probability mass is in the “tail”
3
Sample Word Frequency Data

4
Zipf’s distributions
Rank-Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly
occurring word has rank 1, etc.)
most frequent words
f Distribution of sorted word frequencies,
according to Zipf’s law

w has rank r and

frequency f

read only once

r 5
Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistic professor
George Kingsley Zipf (1902-1950),
 attempts to capture the distribution of the frequencies
(i.e., number of occurances ) of the words within a text.
 Zipf's Law states that when the distinct words in a text

are arranged in decreasing order of their frequency of

occuerence (most frequent words first), the occurence
characterstics of the vocabulary can be characterized by the
constant rank-frequency law of Zipf:
Frequency * Rank = constant
 roughly fit the relation: f * r = c

 Different collections have different constants c

6
Example: Zipf's Law

 The table shows the most frequently occurring words from

336,310 document collection containing 125, 720, 891 total
words; out of which 508, 209 unique words
7
More Example: Zipf’s Law
 Illustration of Rank-Frequency Law. Let the total number of
word occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency (F) R.(F/N)

1 the 69 971 0.070

2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095
8
Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words (upper cut-off).
• Used by almost all systems.
• Significant words: Take words in between the most frequent
(upper cut-off) and least frequent words (lower cut-off).
• Term weighting: Give differing weights to terms based on
their frequency, with most frequent words weighed less.
• Used by almost all ranking methods.

9
Explanations for Zipf’s Law
 The law has been explained by “principle of least effort” which
makes it easier for a speaker or writer of a language to repeat
certain words instead of coining new and different words.
 “principle of least effort”- balance between speaker’s desire for
a small vocabulary and hearer’s desire for a large one.
 Zipf’s Law Impact on IR

 Good News: Stopwords will account for a large fraction of text

so eliminating them greatly reduces inverted-index storage
costs.
 Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis for
query expansion) is difficult since they are extremely rare.
10
Word significance: Luhn’s Ideas
 Luhn Idea (1958): the frequency of word occurrence in a text gives a
useful measurement of word significance.
 Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing.
 For this, Luhn specifies two cutoff points: an upper and a lower

cutoffs based on which non-significant words are excluded

 The words exceeding the upper cutoff were considered to be
common
 The words below the lower cutoff were considered to be rare
 They are not contributing significantly to the content of the text
 The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
11
Let f be the frequency of occurrence of words in a text,
and r their rank in decreasing order of word frequency,
then a plot relating f & r yields the following curve

Luhn (1958) suggested that both extremely common and

extremely uncommon words were not very useful for document
representation & indexing.
12
Vocabulary size : Heaps’ Law
 Dictionaries
 600,000+ words

 But they do not include names of people, locations, products

etc.
 Heap’s law: estimates the number of vocabularies in a given
corpus
 How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
 This determines how the size of the inverted index will
scale with the size of the corpus.

13
Vocabulary Growth: Heaps’ Law
 Heap’s law: estimates the number of vocabularies in a given
corpus
 The vocabulary size grows by O(n ), where β is a constant
β

between 0 – 1.
 If V is the size of the vocabulary and n is the length of the corpus

in words, Heap’s provides the following equation:

Where constants: 
V  Kn


 K  10100
   0.40.6 (approx. square-root)

14
Heap’s distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of tokens

Example: from 1,000,000,000 documents, there may be

1,000,000 distinct words. Can you agree?
15
Example
 We want to estimate the size of the vocabulary for a corpus of
1,000,000 words. However, we only know statistics computed
on smaller corpora sizes:
 For 100,000 words, there are 50,000 unique words

 For 500,000 words, there are 150,000 unique words

 Estimate the vocabulary size for the 1,000,000 words

corpus?
 How about for a corpus of 1,000,000,000 words?

16
17
Text Operations
 Not all words in a document are equally significant to represent the
contents/meanings of a document
 Some word carry more meaning than others

 Noun words are the most representative of a document content

 Therefore, need to preprocess the text of a document in a collection to be used as

index terms
 Using the set of all words in a collection to index documents creates too much
noise for the retrieval task
 Reduce noise means reduce words which can be used to refer to the
document
 Preprocessing is the process of controlling the size of the vocabulary or the
number of distinct words used as index terms
 Preprocessing will lead to an improvement in the information retrieval
performance
 However, some search engines on the Web omit preprocessing
 Every word in the document is an index term
18
Text Operations
Text operations is the process of text transformations in to logical
representations
 5 main operations for selecting index terms, i.e. to choose
words/stems (or groups of words) to be used as indexing terms:
 Lexical analysis/Tokenization of the text - digits, hyphens,
punctuations marks, and the case of letters
 Elimination of stop words - filter out words which are not useful
in the retrieval process
 Stemming words - remove affixes (prefixes and suffixes)

 Construction of term categorization structures such as thesaurus,

to capture relationship for allowing the expansion of the original
query with related terms
 Term weighting – taking into account term and document
frequency 19
Generating Document Representatives
 Text Processing System
 Input text – full text, abstract or title
 Output – a document representative adequate for use in an
automatic retrieval system
 The document representative consists of a list of class names,
each name representing a class of words occurring in the total
input text. A document will be indexed by a name if one of its
significant words occurs as a member of that class.

documents Tokenization stop words stemming Thesaurus

Index terms
Lexical Analysis/Tokenization of Text
Change text of the documents into words to be adopted as
index terms
Objective - identify words in the text
 Digits, hyphens, punctuation marks, case of letters
 Numbers are not good index terms (like 1910, 1999); but
510 B.C. – unique
 Hyphen – break up the words (e.g. state-of-the-art =
state of the art)- but some words, e.g. gilt-edged, B-49
unique words which require hyphens
 Punctuation marks – remove totally unless significant,
e.g. program code: x.exe and xexe
 Case of letters – not important and can convert all to
upper or lower 21
Tokenization
 Analyze text into a sequence of discrete tokens (words).
 Input: “Friends, Romans and Countrymen”
 Output: Tokens (an instance of a sequence of characters that are grouped
together as a useful semantic unit for processing)
 Friends
 Romans

 and

 Countrymen

 Each such token is now a candidate for an index entry, after

further processing
 But what are valid tokens to use?
22
Issues in Tokenization
 One word or multiple: How do you decide it is one token or
two or more?
 Hewlett-Packard  Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 lowercase, lower-case, lower case ?
 data base, database, data-base
 Numbers:
 dates (3/12/91 vs. Mar. 12, 1991);
 phone numbers,
 IP addresses (100.2.86.144)
23
Issues in Tokenization
How to handle special cases involving apostrophes, hyphens
etc? C++, C#, URLs, e-mails, …
 Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a token.
 However, frequently they are not.
 Simplest approach is to ignore all numbers and punctuation and use
only case-insensitive unbroken strings of alphabetic characters as
tokens.
 Generally, systems do not index numbers as text, but often very
useful for search. Will often index “meta-data” , including creation
date, format, etc. separately
 Issues of tokenization are language specific
 Requires the language to be known
 What works for one language doesn’t work for the other. 24
Elimination of STOPWORD
Stopwords are extremely common words across document
collections that have no discriminatory power
 They may occur in 80% of the documents in a collection.
 They would appear to be of little value in helping select
documents matching a user need and needs to be filtered out
as potential index terms

25
Elimination of STOPWORD
 E.g. of stopwords are articles, prepositions, conjunctions, etc.:
 articles (a, an, the);
 pronouns: (I, he, she, it, their, his)
 Some prepositions (on, of, in, about, besides, against),
 conjunctions/connectors (and, but, for, nor, or, so, yet),
 verbs (is, are, was, were),
 adverbs (here, there, out, because, soon, after),
 adjectives (all, any, each, every, few, many, some) can also be
treated as stopwords.
 Stopwords are language dependent.

26
Stopwords
 Intuition:
 Stopwords have little semantic content; It is typical to remove
such high-frequency words
 Stopwords take up 50% of the text. Hence, document size
reduces by 30-50%
 Smaller indices for information retrieval
 Good compression techniques for indices:
 The 30 most common words account for 30% of the tokens
in written text
 Better approximation of importance for classification,
summarization, etc.
27
How to determine a list of stopwords?
 One method: Sort terms (in decreasing order) by collection
frequency and take the most frequent ones
 In a collection about insurance practices, “insurance” would be a
stop word
Another method: Build a stop word list that contains a set
of articles, pronouns, etc.
 Why do we need stop lists: With a stop list, we can compare
and exclude from index terms entirely the commonest words.
 With the removal of stopwords, we can measure better
approximation of importance for classification,
summarization, etc.

28
Stop words
 Stop word elimination used to be standard in older IR systems.
 But the trend is getting away from doing this. Most web search
engines index stop words:
 Good query optimization techniques mean you pay little at
query time for including stop words.
 You need stopwords for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
 Elimination of stopwords might reduce recall (e.g. “To be or
not to be” – all eliminated except “be” – will produce no or
irrelevant retrieval) 29
Normalization
 It is Canonicalizing (change to simplest form) tokens so that
matches occur despite superficial differences in the
character sequences of the tokens
 Need to “normalize” terms in indexed text as well as query terms
into the same form
 Example: We want to match U.S.A. and USA, by deleting
periods in a term
 Case Folding: Often best to lower case everything, since users
will use lowercase regardless of ‘correct’ capitalization…
 Republican vs. republican
 Anti-discriminatory vs. antidiscriminatory
 Car vs. automobile?
30
Normalization issues
 Case folding is good for
 Allowing instances of Automobile at the beginning of a
sentence to match with a query of automobile
 Helping a search engine when most users type ferrari when
they are interested in a Ferrari car
 Bad for
 Proper names vs. common nouns
 E.g. General Motors, Associated Press, …
 Solution:
 lowercase only words at the beginning of the sentence
 In IR, lowercasing is most practical because of the way users
issue their queries
31
Stemming/Morphological analysis
 Stemming reduces tokens to their “root” form of words to
recognize morphological variation .
 The process involves removal of affixes (i.e. prefixes and
suffixes) with the aim of reducing variants to the same stem
 Often removes inflectional and derivational morphology of a

word
 Inflectional morphology: vary the form of words in order to
express grammatical features, such as singular/plural or
past/present tense. E.g. Boy → boys, cut → cutting.
 Derivational morphology: makes new words from old ones.
E.g. creation is formed from create , but they are two separate
words. And also, destruction → destroy

 Stemming is language dependent

 Correct stemming is language specific and can be complex.

32
Stemming
 The final output from a conflation(reducing to the same token)
algorithm is a set of classes, one for each stem detected.
 A Stem: the portion of a word which is left after the removal
of its affixes (i.e., prefixes and/or suffixes).
 Example: ‘connect’ is the stem for {connected, connecting
connection, connections}
 [automate, automatic, automation] all reduce to 
automat
 A document representative then becomes a list of class
names, which are often referred as the documents index
terms/keywords.
 Queries : Queries are handled in the same way.
33
Ways to implement stemming
There are basically two ways to implement stemming.
1. 1st approach is to create a big dictionary that maps words to
their stems.
 The advantage of this approach is that it works perfectly (insofar as the

stem of a word can be defined perfectly); the disadvantages are the space
required by the dictionary and the investment required to maintain the
dictionary as new words appear.
2. 2nd approach is to use a set of rules that extract stems from
words.
 The advantages of this approach are that the code is typically small, and it

can gracefully handle new words; the disadvantage is that it occasionally

makes mistakes. But, since stemming is imperfectly defined, anyway,
occasional mistakes are tolerable, and the rule-based approach is the one
that is generally chosen.
34
Porter Stemmer
Stemming is the operation of stripping the suffices from a word,
leaving its stem.
 Google, for instance, uses stemming to search for web pages
containing the words connected, connecting, connection
and connections when users ask for a web page that contains
the word connect.
 In 1979, Martin Porter developed a stemming algorithm that

uses a set of rules to extract stems from words, and though it

makes some mistakes, most common words seem to work out
right.

35
Porter stemmer
 Most common algorithm for stemming English words to their
common grammatical root
 It is simple procedure for removing known affixes in English
without using a dictionary. To gets rid of plurals the following
rules are used:
 SSES  SS caresses  caress
 IES  i ponies  poni
 SS  SS caress → caress
 S cats  cat
 EMENT   (Delete final ement if what remains is longer
than 1 character )
 replacement  replac
 cement  cement
36
Porter stemmer
 While step 1a gets rid of plurals, step 1b removes -ed
or -ing.
 e.g.
;; agreed -> agree ;; disabled -> disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> meet
;; feed -> feed

37
Stemming: challenges
 May produce unusual stems that are not English words:
 Removing ‘UAL’ from FACTUAL and EQUAL

 May conflate (reduce to the same token) words that are actually
distinct.
 “computer”, “computational”, “computation” all reduced to

same token “comput”

 Not recognize all morphological derivations.

38
Thesauri
 Mostly full-text searching cannot be accurate, since different
authors may select different words to represent the same
concept
 Problem: The same meaning can be expressed using
different terms that are synonyms, homonyms, and related
terms
 How can it be achieved such that for the same meaning the
identical terms are used in the index and the query?

 Thesaurus: The vocabulary of a controlled indexing

language, formally organized so that a priori relationships
between concepts (for example as "broader" and “related")
are made explicit.
39
Thesauri …
• A thesaurus contains terms and relationships between
terms
– IR thesauri rely typically upon the use of symbols such as
USE/UF (UF=used for), BT, and RT to demonstrate
inter-term relationships.
– e.g., car = automobile, truck, bus, taxi, motor vehicle

-color = colour, paint

40
Aim of Thesaurus
 Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms and
homonyms
 The aim of thesaurus is therefore:
 to provide a standard vocabulary for indexing and searching
 Thesaurus rewrite to form equivalence classes, and we index

such equivalences
 When the document contains automobile, index it under

car as well (usually, also vice-versa)

 to assist users with locating terms for proper query
formulation: When the query contains automobile, look under
car as well for expanding query
 to provide classified hierarchies that allow the broadening and
narrowing of the current request according to user needs
41
Thesaurus Construction
Example: thesaurus built to assist IR for searching cars and
vehicles :
Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
Road Transport

42
More Example
Example: thesaurus built to assist IR in the fields of computer
science:
TERM: natural languages
 UF natural language processing (UF=used for NLP)

 BT languages (BT=broader term is languages)

 TT languages (TT = (top term) is languages)

 RT artificial intelligence (RT=related term/s)

 computational linguistic
 formal languages
 query languages
 speech recognition
43
Language-specificity

 Many of the above features embody transformations that

are
 Language-specific and

 Often, application-specific

 These are “plug-in” addenda to the indexing process

 Both open source and commercial plug-ins are available for
handling these issues

44
INDEX TERM SELECTION
 Index language is the language used to describe documents
and requests
 Elements of the index language are index terms which may be
derived from the text of the document to be described, or may
be arrived at independently.
 If a full text representation of the text is adopted, then all words in
the text are used as index terms = full text indexing
 Otherwise, need to select the words to be used as index terms for
reducing the size of the index file which is basic to design an efficient
searching IR system

Statistical Properties of Word Frequency
No ratings yet
Statistical Properties of Word Frequency
45 pages
SR - No Questions Correct Answer 1. C
No ratings yet
SR - No Questions Correct Answer 1. C
8 pages
Chapter 5 - Architecture in The Life Cycle
No ratings yet
Chapter 5 - Architecture in The Life Cycle
26 pages
Intro to Object-Oriented Concepts
No ratings yet
Intro to Object-Oriented Concepts
35 pages
Introduction to IR Models and Techniques
100% (1)
Introduction to IR Models and Techniques
32 pages
Term Weighting
No ratings yet
Term Weighting
71 pages
Transaction Management & Concurrency
100% (1)
Transaction Management & Concurrency
28 pages
OpenGL State Management Basics
No ratings yet
OpenGL State Management Basics
20 pages
CHAPTER - 3rm
No ratings yet
CHAPTER - 3rm
19 pages
Arba Minch Municipality Info System
0% (1)
Arba Minch Municipality Info System
104 pages
Application Software for Grade 9 Students
No ratings yet
Application Software for Grade 9 Students
22 pages
Assignment ON Data Mining
No ratings yet
Assignment ON Data Mining
24 pages
CSC331 Midterm Exam Answers
100% (2)
CSC331 Midterm Exam Answers
11 pages
Internetworking Concepts Overview
100% (1)
Internetworking Concepts Overview
86 pages
Advanced Programming-1
100% (1)
Advanced Programming-1
8 pages
Holistic COC L 4 Question WDDBA
No ratings yet
Holistic COC L 4 Question WDDBA
16 pages
Record Client Support Requirements Information Sheet
0% (1)
Record Client Support Requirements Information Sheet
13 pages
Configure and Use Internet
100% (1)
Configure and Use Internet
9 pages
IoT Quiz: Characteristics & Reliability Issues
No ratings yet
IoT Quiz: Characteristics & Reliability Issues
8 pages
Distributed Systems Exam Questions
100% (1)
Distributed Systems Exam Questions
3 pages
Application Modeling
No ratings yet
Application Modeling
9 pages
WAN Technologies and Their Applications
No ratings yet
WAN Technologies and Their Applications
16 pages
Sample Information Assurance and Security Questions For Exit Exam
No ratings yet
Sample Information Assurance and Security Questions For Exit Exam
8 pages
Software and Engineering (Questions and Answers-1)
100% (2)
Software and Engineering (Questions and Answers-1)
4 pages
@aconcise Inclusiveness Final Exam AASTU 2013 EC
No ratings yet
@aconcise Inclusiveness Final Exam AASTU 2013 EC
5 pages
r05410307 Selected Topics in Computer Science
No ratings yet
r05410307 Selected Topics in Computer Science
1 page
Event Driven Programming Mock Exam Questions
No ratings yet
Event Driven Programming Mock Exam Questions
5 pages
Multimedia Systems Mid Exam
100% (1)
Multimedia Systems Mid Exam
4 pages
Aau Cs Model Exit 2025
No ratings yet
Aau Cs Model Exit 2025
18 pages
C Programming: Cloud IT Online Practices Problem
No ratings yet
C Programming: Cloud IT Online Practices Problem
24 pages
Distributed Systems Course Overview
No ratings yet
Distributed Systems Course Overview
15 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
100% (1)
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
System Analysis Design MCQ Worksheet For Exit Exam 2015@
No ratings yet
System Analysis Design MCQ Worksheet For Exit Exam 2015@
11 pages
Grade 10 Test
No ratings yet
Grade 10 Test
6 pages
Computer Networks Quiz
No ratings yet
Computer Networks Quiz
9 pages
8086 Memory and I/O Interface Guide
No ratings yet
8086 Memory and I/O Interface Guide
52 pages
AMU Exit Exam Model V1
No ratings yet
AMU Exit Exam Model V1
43 pages
CS Exam: Internet Programming
No ratings yet
CS Exam: Internet Programming
2 pages
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
Internet Programming Fundamentals Guide
No ratings yet
Internet Programming Fundamentals Guide
60 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Chapter-1 Introduction To Computer and Data Security
No ratings yet
Chapter-1 Introduction To Computer and Data Security
42 pages
MCQS C++
No ratings yet
MCQS C++
57 pages
Database Recovery Techniques Overview
100% (1)
Database Recovery Techniques Overview
46 pages
Depth First Traversal and Java Concepts Quiz
No ratings yet
Depth First Traversal and Java Concepts Quiz
15 pages
System Analysis & Design Syllabus
No ratings yet
System Analysis & Design Syllabus
10 pages
Tolera
No ratings yet
Tolera
41 pages
Motherboard Selection Criteria Guide
No ratings yet
Motherboard Selection Criteria Guide
3 pages
Enhancing Afaan Oromo Information Retrieval
No ratings yet
Enhancing Afaan Oromo Information Retrieval
108 pages
Multimedia Data Compression: 1 IT 3rd Year, Compiled by Waggari M
No ratings yet
Multimedia Data Compression: 1 IT 3rd Year, Compiled by Waggari M
29 pages
Amharic Sentiment Mining Thesis
No ratings yet
Amharic Sentiment Mining Thesis
86 pages
Dbms HTML Connection With Example
No ratings yet
Dbms HTML Connection With Example
18 pages
MCQs
No ratings yet
MCQs
16 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
43 pages
MIS Final Solved Exam 2024
No ratings yet
MIS Final Solved Exam 2024
16 pages
C++ Data Structures Lab Guide
No ratings yet
C++ Data Structures Lab Guide
26 pages
Chapter Two IR
No ratings yet
Chapter Two IR
45 pages
Statistical Properties of Text Analysis
No ratings yet
Statistical Properties of Text Analysis
25 pages
Word Frequency Distribution Insights
No ratings yet
Word Frequency Distribution Insights
46 pages
Computer
No ratings yet
Computer
24 pages
Query Reformulation Techniques in Search
No ratings yet
Query Reformulation Techniques in Search
35 pages
Indexing Structures Explained
No ratings yet
Indexing Structures Explained
44 pages
Text Processing & Term Weighting
100% (2)
Text Processing & Term Weighting
38 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Evaluating Information Retrieval Systems
No ratings yet
Evaluating Information Retrieval Systems
26 pages
Error Recovery Techniques
No ratings yet
Error Recovery Techniques
2 pages
UNIT4
No ratings yet
UNIT4
10 pages
Grammar and Composition Lesson Plan
No ratings yet
Grammar and Composition Lesson Plan
3 pages
Korean Phrase
No ratings yet
Korean Phrase
5 pages
TTC File Two
No ratings yet
TTC File Two
26 pages
Compounding & Reduplication Final 1
No ratings yet
Compounding & Reduplication Final 1
23 pages
Meg 4
No ratings yet
Meg 4
13 pages
Spanish Writing Cohesion Guide
No ratings yet
Spanish Writing Cohesion Guide
4 pages
Noun Basics for Students
No ratings yet
Noun Basics for Students
7 pages
French Cognates Glossary
100% (1)
French Cognates Glossary
95 pages
Grammar Unit 6 3star
No ratings yet
Grammar Unit 6 3star
1 page
Grade 3 English Olympiad Quiz
No ratings yet
Grade 3 English Olympiad Quiz
29 pages
Narration Techniques
100% (1)
Narration Techniques
34 pages
First Grading Period
No ratings yet
First Grading Period
10 pages
5.3 Grade 5 English Handout - Term 2 2023-2024
No ratings yet
5.3 Grade 5 English Handout - Term 2 2023-2024
33 pages
Friend Group Comparatives and Superlatives
75% (4)
Friend Group Comparatives and Superlatives
2 pages
Mastering Phrasal Verbs in English
No ratings yet
Mastering Phrasal Verbs in English
14 pages
Bijlage 1 Irregular Verbs
No ratings yet
Bijlage 1 Irregular Verbs
3 pages
Максімов Переклад
No ratings yet
Максімов Переклад
5 pages
English Grammar by EF
100% (1)
English Grammar by EF
139 pages
Basic French Exercises
100% (3)
Basic French Exercises
38 pages
Teacher Guide For ACSI Spelling
No ratings yet
Teacher Guide For ACSI Spelling
4 pages
Understanding Simple Present and Past Tenses
No ratings yet
Understanding Simple Present and Past Tenses
7 pages
English Verb Patterns Guide
No ratings yet
English Verb Patterns Guide
3 pages
Tenses New
No ratings yet
Tenses New
5 pages
10 Class English Text and Grammar Book
No ratings yet
10 Class English Text and Grammar Book
151 pages
Super Grammar 1 PDF
80% (5)
Super Grammar 1 PDF
64 pages
Level 1 Unit 3c: Irregular Plural Nouns
0% (1)
Level 1 Unit 3c: Irregular Plural Nouns
2 pages
Understanding Past Tenses in English
No ratings yet
Understanding Past Tenses in English
7 pages
Writing: A Textbook To Improve Essay Writing and To Avoid Chinglish. Volume 1
No ratings yet
Writing: A Textbook To Improve Essay Writing and To Avoid Chinglish. Volume 1
6 pages

Text Operations for Information Retrieval

Uploaded by

Text Operations for Information Retrieval

Uploaded by

1

w has rank r and

read only once

are arranged in decreasing order of their frequency of

 Different collections have different constants c

 The table shows the most frequently occurring words from

1 the 69 971 0.070

 Good News: Stopwords will account for a large fraction of text

cutoffs based on which non-significant words are excluded

Luhn (1958) suggested that both extremely common and

 But they do not include names of people, locations, products

in words, Heap’s provides the following equation:

Example: from 1,000,000,000 documents, there may be

 For 500,000 words, there are 150,000 unique words

 Estimate the vocabulary size for the 1,000,000 words

 Noun words are the most representative of a document content

 Therefore, need to preprocess the text of a document in a collection to be used as

 Construction of term categorization structures such as thesaurus,

documents Tokenization stop words stemming Thesaurus

 Each such token is now a candidate for an index entry, after

 Stemming is language dependent

can gracefully handle new words; the disadvantage is that it occasionally

uses a set of rules to extract stems from words, and though it

same token “comput”

 Not recognize all morphological derivations.

 Thesaurus: The vocabulary of a controlled indexing

-color = colour, paint

car as well (usually, also vice-versa)

 BT languages (BT=broader term is languages)

 TT languages (TT = (top term) is languages)

 RT artificial intelligence (RT=related term/s)

 Many of the above features embody transformations that

 These are “plug-in” addenda to the indexing process

You might also like