Professional Documents
Culture Documents
• Derivation:
– Ex: compute computer computerization
– Less systematic that inflection
– It can involve a change of meaning
• Compounding:
– Merging of two or more words into a new word
• Downmarket, (to) overtake
Stemming & Lemmatization
• The removal of the inflectional ending from words (strip off any
affixes)
• Laughing, laugh, laughs, laughed laugh
– Problems
• Can conflate semantically different words
– Gallery and gall may both be stemmed to gall
– Regular Expressions for Stemming
– Porter Stemmer
– nltk.wordnet.morphy
• A further step is to make sure that the resulting form is a known
word in a dictionary, a task known as lemmatization.
Grammar: words: POS
• Words of a language are grouped into classes to reflect similar
syntactic behaviors
• Syntactical or grammatical categories (aka part-of-speech)
– Nouns (people, animal, concepts)
– Verbs (actions, states)
– Adjectives
– Prepositions
– Determiners
• Open or lexical categories (nouns, verbs, adjective)
– Large number of members, new words are commonly added
• Closed or functional categories (prepositions, determiners)
– Few members, clear grammatical use
Part-of-speech (English)
Terminology
• Tagging
– The process of associating labels with each token in a text
• Tags
– The labels
– Syntactic word classes
• Tag Set
– The collection of tags used
Example
• Typically a tagged text is a sequence of white-space separated
base/tag tokens:
These/DT
findings/NNS
should/MD
be/VB
useful/JJ
for/IN
therapeutic/JJ
strategies/NNS
and/CC
the/DT
development/NN
of/IN
immunosuppressants/NNS
targeting/VBG
the/DT
CD28/NN
costimulatory/NN
pathway/NN
./.
Part-of-speech (English)
Part-of-Speech Ambiguity
– A tagger that always chooses the most common tag is 90% correct (often
used as baseline)
2. Introduces Distinctions
• Ambiguities may be resolved
• e.g. deal tagged with NN or VB
P(t n | wn)
18
N-Gram Tagging
• An n-gram tagger is a generalization of a unigram tagger whose context is the
current word together with the part-of-speech tags of the n-1 preceding tokens
• A 1-gram tagger is another term for a unigram tagger: i.e., the context used to
tag a token is just the text of the token itself. 2-gram taggers are also called
bigram taggers, and 3-gram taggers are called trigram taggers.
trigram tagger
P(t n | wn, tn 1, tn 2 )
19
N-Gram Tagging
• Why not 10-gram taggers?
• As n gets larger, the specificity of the contexts increases, as does the chance
that the data we wish to tag contains contexts that were not present in the
training data.
• This is known as the sparse data problem, and is quite pervasive in NLP. As a
consequence, there is a trade-off between the accuracy and the coverage of our
results (and this is related to the precision/recall trade-off)
Markov Model Tagger
• Bigram tagger
• Assumptions:
– Words are independent of each other
t1 t2 tn
w1 w2 wn
P(t , w) P(t 1, t 2,.., tn, w1, w2,.., wn, ) P(t i | ti 1 )P(wi | ti)
i
Rule-Based Tagger
• The Linguistic Complaint
– Where is the linguistic knowledge of a tagger?
– Could thus use handcrafted sets of rules to tag input sentences, for example,
if input follows a determiner tag it as a noun.
23
The Brill tagger
(transformation-based tagger)
• An example of Transformation-Based Learning
– Basic idea: do a quick job first (using frequency), then revise it using
contextual rules.
• Very popular (freely available, works fairly well)
– Probably the most widely used tagger (esp. outside NLP)
– …. but not the most accurate: 96.6% / 82.0 %
• A supervised method: requires a tagged corpus
Brill Tagging: In more detail
• Word segmentation
– For languages that do not put spaces between words
• Chinese, Japanese, Korean, Thai, German (for compound nouns)
• Tokenization
• Sentence segmentation
– Divide text into sentences
Tokenization
• Divide text into units called tokens (words, numbers, punctuations)
• What is a word?
– Graphic word: string of continuous alpha numeric character surrounded by white
space
• $22.50
– Main clue (in English) is the occurrence of whitespaces
– Problems
• Periods: usually remove punctuation but sometimes it’s useful to keep periods
(Wash. wash)
• Single apostrophes, contractions (isn’t, didn’t, dog’s: for meaning extraction could
be useful to have 2 separate forms: is + n’t or not)
• Hyphenation:
– Sometime best a single word: co-operate
– Sometime best as 2 separate words: 26-year-old, aluminum-export ban
• Hyponymy
– scarlet, vermilion, carmine, and crimson are all hyponyms of red
• Antonymy (opposite)
– Male, female
• Etc..
Word Senses
• Words have multiple distinct meanings, or senses:
– Plant: living plant, manufacturing plant, …
– Title: name of a work, ownership document, form of address, material at the
start of a film, …
• Many levels of sense distinctions
– Homonymy: totally unrelated meanings (river bank, money bank)
– Polysemy: related meanings (star in sky, star on tv, title)
– Systematic polysemy: productive meaning extensions (metonymy such as
organizations to their buildings) or metaphor
– Sense distinctions can be extremely subtle (or not)
• Granularity of senses needed depends a lot on the task
Word Sense Disambiguation
• Supervised learning
– When we know the truth (true senses) (not always true or easy)
– Classification task
– Many competing classification technologies perform about the same (it’s all
about the knowledge sources you tap)
• Collocations
– White skin, white wine, white hair
– Semantic similarity
– (Logical metonymy)
– Selectional preferences
Collocations
• A collocation is an expression consisting of two or more words that
correspond to some conventional way of saying things
– Noun phrases: weapons of mass destruction, stiff breeze (but why not *stiff
wind?)
– Verbal phrases: to make up
– Not necessarily contiguous: knock…. door
• Limited compositionality
– Compositional if meaning of expression can be predicted by the meaning of the
parts
– Idioms are most extreme examples of non-compositionality
• Kick the bucket
Collocations
• Non Substitutability
– Cannot substitute words in a collocation
• *yellow wine
• Non modifiability
– To get a frog in one’s throat
• *To get an ugly frog in one’s throat
• Useful for
– Language generation
• *Powerful tea, *take a decision
– Machine translation
• Easy way to test if a combination is a collocation is to translate it into another
language
– Make a decision *faire une decision (prendre), *fare una decisione (prendere)
Finding collocations
• Frequency
– If two words occur together a lot, that may be evidence that they have a
special function
– Filter by POS patterns
– A N (linear function), N N (regression coefficients) etc..
• Mean and variance of the distance of the words
• For not contiguous collocations
• Mutual information measure
51
Lexical acquisition
• Examples:
– “insulin” and “progesterone” are in WordNet 2.1 but “leptin” and
“pregnenolone” are not.
• Measures of similarity
– WordNet-based
– Vector-based
• Why Important?
– To infer meaning from selectional restrictions
• Suppose we don’t know the words durian (not in the vocabulary)
• Statistical models
The NLP Pipeline
• For a given problem to be tackled
1. Choose corpus (or build your own)
– Low level processing done to the text before the ‘real work’ begins
• Important but often neglected
– Low-leveling formatting issues
• Junk formatting/content (Html tags, Tables)
• Case change (i.e. everything to lower case)
• Tokenization, sentence segmentation
2. Choose annotation to use (or choose the label set and label it
yourself )
1. Check labeling (inconsistencies etc…)
3. Extract features
4. Choose or implement new NLP algorithms
5. Evaluate
6. (eventually) Re-iterate
Corpora
• Text Corpora & Annotated Text Corpora
– NLTK corpora
– Use/create your own
• Lexical resources
– WordNet
– VerbNet
– FrameNet
– Domain specific lexical resources
• Corpus Creation
• Annotation
Annotated Text Corpora
• Not part of the text in the file; it explains something of the structure
and/or semantics of text
Annotated Text Corpora
• Grammar annotation
– POS, parses, chunks
• Semantic annotation
– Topics, Named Entities, sentiment, Author, Language, Word senses, co-
reference …
• Lower level annotation
– Word tokenization, Sentence Segmentation, Paragraph Segmentation
Processing Search Engine Results
• The web can be thought of as a huge corpus of unannotated text.
• Event A is a subset of Ω
P : Ω 0,1
Prior Probability
P ( A)
Conditional probability
P( A | B)
P(A,B)
P(A|B)
P(B)
Conditional probability (cont)
P( A, B) P( A | B) P( B)
P( B | A) P( A)
• Note: P(A,B) = P(A ∩ B)
• Chain Rule
• P(A, B) = P(A|B) P(B) = The probability that A and B both happen is the
probability that B happens times the probability that A happens, given B has
occurred.
• P(A, B) = P(B|A) P(A) = The probability that A and B both happen is the
probability that A happens times the probability that B happens, given A has
occurred.
• Multi-dimensional table with a value in every cell giving the probability of
that specific state occurring
74
Chain Rule
P(A,B) = P(A|B)P(B)
= P(B|A)P(A)
P(A,B,C,D…) = P(A)P(B|A)P(C|A,B)P(D|A,B,C..)
75
Chain Rule Bayes' rule
P(B|A)P(A)
P(A,B) = P(A|B)P(B) P(A|B)
P(B)
= P(B|A)P(A)
Bayes' rule
76
Bayes' rule
P(A|B)P(A)
P(A|B)
P(B)
For example, if A is the event that a patient has a disease, and B is the event
that she displays a symptom, then P(B | A) describes a causal relationship, and
P(A | B) describes a diagnostic one (that is usually hard to assess).
If P(B | A), P(A) and P(B) can be assessed easily, then we get P(A | B) for free.
77
Example
• S:stiff neck, M: meningitis
• P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20
• I have stiff neck, should I worry?
P( S | M ) P( M )
P( M | S )
P( S )
0.5 1 / 50,000
0.0002
1 / 20
(Conditional) independence
• Two events A e B are independent of each other if
P(A) = P(A|B)
s’ = argmaxsk P(sk | c)
Graphical Models
• Within the Machine Learning framework
• Probability theory plus graph theory
• Widely used
– NLP
– Speech recognition
– Systems diagnosis
– Computer vision
– Bioinformatics
(Quick intro to)
Graphical Models
86
Graphical Models
x1 x2 x3
• Xi depend on Y
• Naïve Bayes assumption: all xi are independent given Y
• Currently used for text classification and spam detection
Naïve Bayes models
w1 w2 wn
sk
v1 v2 v3
sk
v1 v2 v3
sk
v1 v2 v3
P( w1 , w2 ,..., wN )
• Probabilities should broadly indicate likelihood of sentences
– P( I saw a van) >> P( eyes awe of an)
• Not grammaticality
– P(artichokes intimidate zippers) ≈ 0
• In principle, “likely” depends on the domain, context, speaker…
93
Language models
• Related: the task of predicting the next word
P(w n | w1,...,wn 1 )
• Can be useful for
– Spelling corrections
• I need to notified the bank
– Machine translations
– Speech recognition
– OCR (optical character recognition)
– Handwriting recognition
– Augmentative communication
• Computer systems to help the disabled in communication
– For example, systems that let choose words with hand movements
94
Language Models
• Model to assign scores to sentences
P( w1 , w2 ,..., wN )
95
Markov assumption: n-gram solution
P( wi w1w2 ,..., wi 1 ) w1 wi
• Markov assumption: only the prior local context --- the last
“few” n words– affects the next word
• N-gram models: assume each word depends only on a short
linear history
– Use N-1 words to predict the next one
P(w i | wi n , , , , wi 1 ) Wi-3 wi
P( w1 , w2 ,..., wN ) P( wi wi n ...wi 1 )
i 96
Markov assumption: n-gram solution
P(w i | wi n , , , , wi 1 ) P(w i)
• Unigrams (n =1)
P(w1, w2,...wn, ) P(w i)
i
P(w i | wi n , , , , wi 1 ) P(w i | wi n)
• Bigrams (n = 2)
P(w1, w2,...wn, ) P(wi | wi 1 )
i
P(w i | wi n , , , , wi 1 ) P(w i | wi 2, wi 1 )
• Trigrams (n = 3)
P( w1 , w2 ,..., wN ) P( wi | wi 1 , wi 2 )
i
97
Choice of n
• In principle we would like the n of the n-gram to be large
– green
– large green
– the large green
– swallowed the large green
– swallowed should influence the choice of the next word
(mountain is unlikely, pea more likely)
– The crocodile swallowed the large green ..
– Mary swallowed the large green ..
– And so on…
Discrimination vs. reliability
• But it’s much harder to get reliable statistics since the number of
parameters to estimate becomes too large
– The larger n, the larger the number of parameters to estimate, the larger the
data needed to do statistically reliable estimations
Language Models
• N size of vocabulary
• Unigrams
P(w i | wi n , , , , wi 1 ) P(w i) For each wi calculate P(wi):
N of such numbers: N parameters
• Bi-grams
For each wi, wj
P( w1 , w2 ,..., wN ) P( wi | wi 1 ) calculate P(wi | wj,):
i
NxN parameters
• Tri-grams
For each wi, wj wk
P( w1 , w2 ,..., wN ) P( wi | wi 1 , wi 2 ) calculate P(wi | wj, wk):
i
NxNxN parameters
100
N-grams and parameters
• Assume we have a vocabulary of 20,000 words
• Growth in number of parameters for n-grams models:
Model Parameters
Bigram model 20,0002 = 400 million
Trigram model 20,0003 = 8 trillion
Four-gram model 20,0004 = 1.6 x 1017
101
Sparsity
• Zipf’s law: most words are rare
• New words appear all the time, new bigrams more often, trigrams or
more, still worse!
c ( wi )
P ( wi ) P ( wi ) 0 if c( wi ) 0
N
c( wi , wi 1)
P( wi | wi 1) 0 if c( wi , wi 1) 0
c( wi 1)
c( wi , wi 1, wi 2)
P( wi | wi 1, wi 2) 0 if c( wi , wi 1, wi 2) 0
c( wi 1, wi 2)
• These relative frequency estimates are the MLE (maximum likelihood estimates):
choice of parameters that give the highest probability to the training corpus
102
Sparsity
• The larger the number of parameters, the more likely it is to get 0
probabilities
P( w1 , w2 ,..., wN ) P( wi wi n ...wi 1 )
i
104