Professional Documents
Culture Documents
Information Retrieval
Document ingestion
Introduction to
Information Retrieval
Tokens
Recall the basic indexing pipeline
Documents to Friends, Romans, countrymen.
be indexed
Tokenizer
Linguistic modules
Indexer 2 4
friend
1 2
Inverted index roman
13 16
countryman
Sec. 2.1
Parsing a document
• What format is it in?
– pdf/word/excel/html?
• What language is it in?
• What character set is in use?
– UTF-8 (encoding schemes)
Complications: Format/language
• Sometimes a document or its components can
contain multiple languages/formats
– French email with a German pdf attachment.
– French email quote clauses from an English-language
contract
• Documents being indexed can include docs from
many different languages
– A single index may contain terms from many
languages.
Tokenization
▪ Input: “Friends, Romans and Countrymen”
▪ Output: Tokens
▪ Friends
▪ Romans
▪ Countrymen
• tokenization is the task of chopping a sequence of characters up into
pieces, called tokens, perhaps at the same time throwing away certain
characters, such as punctuation
▪ A token is an instance of a sequence of characters
▪ Each such token is now a candidate for an index entry, after further
processing
▪ A type is the class of all tokens containing the same character sequence.
▪ A term is a (perhaps normalized) type that is included in the IR system's
dictionary.
▪ Example: to sleep perchance to dream,
-there are 5 tokens, but only 4 types
Sec. 2.2.1
Tokenization
• Issues in tokenization:
– Finland’s capital →
Finland AND s? Finlands? Finland’s?
– Hewlett-Packard → Hewlett and Packard as two
tokens?
• state-of-the-art: break up hyphenated sequence.
• co-education
• lowercase, lower-case, lower case ?
• It can be effective to get the user to put in possible hyphens
Numbers
• 3/20/91 Mar. 12, 1991
20/3/91
• 55 B.C.
• B-52
• My PGP key is 324a3df234cb23e
• (800) 234-2333
– Often have embedded spaces
– Older IR systems may not index numbers
• But often very useful: think about things like looking up
date of an Email
– Will often index separately as “doc meta-data”
Sec. 2.2.1
❑ Determined by collection frequency – the total number of times each term appears in
the document collection
❑ Using a stop list significantly reduces the number of postings that a system has to store
Stop words