You are on page 1of 19

3.

5 - Features 2
Normalization
Token normalization

"The process of canonicalizing tokens so that matches occur despite


superficial differences"

- Manning, Raghavan and Schutze (p.28)


Equivalence Classes

U.S.A
USA Automobile
Car
cliché
cliche

Anti-discriminatory
color
antidiscriminatory colour
Case-folding

● Case-folding converts all characters to lowercase


● Most common type of normalization: "Coffee" and "coffee" are equivalent
● Some concerns: proper nouns where capitalization distinguishes them
from a common noun: Bush, Fed, General Motors
True casing

He saw a buffalo in Buffalo.


Will retain difference between buffalo (animal) and Buffalo (city)
True casing

Buffalo in Buffalo, what a sight!


Still good with true-casing.
True casing

Buffalo, NY doesn't actually have any buffalo


Hmmm...
True casing

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo


Let's not go there!
Alternate spellings

● Favorite, favourite
● Modelling, modeling
● Gim'me, gimme
● Hallowe'en, halloween
Internationalized names and transliterations

● Warsaw and Warszawa


● Wroclaw and Breslau
● Beijing and Peking
● Dostoevsky, Dostoyevsky, Dostoyevski, Dostoevskii
VIAF

VIAF.org

"Authority files" normalizing naming and variants of people


https://viaf.org/viaf/104023256/#Dostoyevsky
,_Fyodor,_1821-1881

685 Alternate Name Forms!!


Numbers and Dates

● 100,000, 100 000 and 100000


● 2/4/2017 and Feb 4th 2017
○ Problem: US vs standards: 2/4 means April 2nd outside of the US!
Japanese

● Hiragana and Katakana, Chinese, and western characters


Stemming / Lemmatization

Stemming: rules to chop of the end of the word and hope for the best

Works with regular word forms, which is the majority of the time

E.g. Colors, color, colours


Stemming / Lemmatization

Lemmatization: uses dictionary and/or analysis of words to return to the


dictionary form of a word (lemma)

E.g.

am, are, it => be

Car, cars, car's, cars' => car

You might also like