You are on page 1of 12

General course info:

 To get help: email to teaching.assistants.NLP@gmail.com

Lecture 1:

Children learn first language in critical period (2-7 years) automatically and without any effort =First
language acquisition (it is believed that’s because infants have a language acquiring device which
enables the infants to acquire and produce language).
Learning a language in a later state of live is by choice and takes more effort (=Second language
acquisition)

Important tools:
 finite-state automaton (Five-tuple) can be used to represent a language, can be
deterministic (always gives same result) or non-deterministic
 regular languages (class of languages):
o only regular if a FSA accepts the language
o A* (=Kleene star) means A x A x A = AI
o Complement of a regular language is a regular language (same with intersection of 2
regular languages)
o English is not regular
 regular expressions (search patterns): represents regular languages or finite state atomata
o i.e. a regex: a sequence of characters that defines a search pattern
o python regex syntax:

* some shortcuts:

* /1 is for checking if it matches the earlier identified part in a string

there are automaton for word lists -> can be used (together form a language)
 don’t want proper names (=names of person, locations, organisations, etc) in those list
because you will go on forever -> need to be identified. words starting with capital are names
or phrases consisting of such name or with phrases with van/der etc
o However input texts need to be pre-processed (i.e. removing capital characters) so
how to identify? And still doesn’t recognise all names.
Lecture 2:

Finite state transducers (FST): (six-tuple)


 Produces an output string (could be given another string-> change string) (using transducers
(see picture, a:b means if have a -> change to b. Write down as ‘regular language accepts
{<input, output>, …}” ))
* Converting is context sensitive
 Used for testing if strings belong to a language or not (classification) (i.e. can StringA be made
with this FST? If yes-> belongs to language)
 FSTs are nondeterministic. (For FSAs there is always a deterministic version, that’s not always
for FSTs)
 Weighted finite-state transducers: FST where transitions have probabilities. find min cost ->
make such that only min cost sentence is correct sentence
o With Viterbi algorithm: Follow all possible paths from initial state -> When 2 paths
end up in the same state (not only the finite state)-> keep only the lowest cost one
(so prune the rest)-> continue till end is reached-> will result in cheapest path
* if 2 have same cost-> keep them both
Advantage: don’t need to check all paths (uses decomposition)

Tokenization / segmentation:
Turn something (string/text) into tokens (=smaller chunks) (can also be just letters or words)
 Mostly done as pre-processing for complex applications
 Rules can be created with regex and FST (mostly weighted)
 Often based on punctuation (i.e.: ‘?’, ‘!’, ‘.’, ‘ ’), but this does not always work (i.e. Yew York
will be split or ‘don’t’ -> ‘don’, ‘t’)
 Hard for non-alphabetic languages (i.e. Chinese) -> use weighted FST

Inflectional morphology: (=language defines how words are built up from smaller parts) (creates
new forms of same words)
 Morphemes= smallest units in a language that have some meaning, and cannot be divided
into smaller units without changing the meaning
 Word can change form regularly or irregularly (small-> smaller vs good-> better), identified
by productive patterns (=rules, are more efficient than databases)
 Morpheme types:
o Free morphemes: indivisible words (i.e. stop, fox, eat)
o Bound morphemes: cannot be words by themselves (i.e. affixes: s, ing, un, able)
* different from syllables
 Hapaxes = words that only occur once in a text corpus. Are the evidence for morphological
rules. Meaning of words can be computed from the meaning of morphological constituents.
Rank of a word: position of a word when all permutations are written in alphabetical order.
(So for string A: change place of letters to get all different letter sequences-> write all made
(non-existent) ‘words’ in alphabetical order-> where string A is placed in the order is the
rank)

Stemming:
break down words into stem (art of word that is common to all variations) and rest (smaller=
small+rest)
 Algorithm: Porter algorithm: series of rewrite rules implemented as FST to get the stem of
the word.
* is lexicon-free (doesn’t need dictionary, just slices words based on rules-> )
* But only works for regular words

Lemmatization:
The lemma represents all the forms / inflections of verbs
 A lemma is a word itself, a stem doesn’t have to. (i.e. go -> represents goes, went, going)
 Mapping from base form = lemmatization.
 Uses database -> for irregular words, but sometimes thinks verbs are nouns (instead of
verbs) and gives wrong output-> can be identified, but only via hardcoding

Evaluation of NLP component:


 Intrinsic evaluations: measure performance on its defined subtask (i.e. accuracy), usually
against a defined standard in a reproducible setting
 Extrinsic evaluations: measure component’s contribution to performance of a complete
application. (performance on different problem (database))

Derivational morphology: (creates total new words)


 In Agglutinative languages: most words are formed by joining morphemes together (i.e.
Japanese)

Morphemes word creating types:


 Affixes: bound morphemes that can be combined with stem to form new words (i.e. suffix,
prefix, infix (i.e. opblazen-> opgeblazen), circumfix (i.e. gooien-> gegooid)) (english only has
suffix and prefix)
 Reduplication: repeat a morpheme (i.e. orang=person, orang-orang=people) (not in dutch or
English)
 Blending: combine parts of morphemes (i.e. fog+smoke = smog)
 Acronyms: create new word from initial letters of others (abbreviations)
 Clipping: remove parts of words (doc (doctor), gym (gymnastics))
 Back-formation: remove affixes or quasi-morphemes (editor-> edit, burglar-> burgle)

Other morpheme classes:


 Allomorphs: different phonetic forms or variations of a morpheme, same meaning (i.e.
plural)
 Homonyms: spelled and pronounced the same but have different meaning (meaning can
only be obtained by context) (i.e. bank)
 Homophones: sound same, but have different meanings and spellings (i.e. plain, plane)

Inflection vs derivation:
 In inflection you don’t change meaning of the word (word remains within same lexeme
(=group of words with same shape and meaning) / overall meaning). And grammatical info
stays the same
 Derivation changes the meaning and its lexeme (and often the syntactic class). Grammatical
info is changed (i.e. different word class)
o Need Morphological parser: divides words into their morphemes. Consists of:
 Lexicon (=dictionary): with stems and affixes
 Morphotactic rules: specify how morphemes can be combined (i.e. where
should affix be placed relative to stem)
 Orthographic rules: specify the required spelling changes after combining
morphemes (i.e. cherry + PL = cherries)
* often are FSTs, hard to make
 Examples:
Stem Derived word Inflected from
Work Worker Works
Red Redness Red (i.e. car)
Friend Friendship Friends
Lecture 3:

Word similarity: (syntax)


 When words are i.e. misspelled-> how know what correct word probably is -> requires
orthographic similarity
 Transform any string into another
o Basic operations / edits:
 Deletion (of a letter)
 Insertion
 Substitution
o Minimum number of “edits” that are required to change to another word= minimum
edit distance / levenshtein distance
 How to always get optimal result / Levenshtein distance: (algorithm)
1. Use distance matrix: fill in cost of basic operations to make the
target word (so far) from the source (so far) . If 2 letters are the
same-> substitution has cost 0.
2. When filled in-> optimal edit path =trace back from bottom right to
top left by lowest number in surrounding (least costly, not unique)
* see algorithm equations on slide 26 (for more clarity)

Language models (model that predicts the next word from previous words):
 Task of assigning probabilities to the sentences of a language-> get word probabilities
 Models can also be defined on characters or sounds instead of words
 A chunk of N consecutive words is called N-grams (unigram, bigram, trigram, 4-gram etc),
can be used to estimate probability of sentence -> calculate by:
o Bigram approximation: (P of the bigram occurring)
(Based on Markov assumption: p of word wk only
depends on previous word wk-1)
 i.e. P(do|I) = 0.4 , P(them|like) = 0.5 , P(not|do) = 1 , P(like|not) = 1
What is the probability of the sentence “I do not like them” relative to this
corpus?
P(I do not like them) = P(do|I) • P(not|do) • P(like|not) • P(them|like)
= 0.4 • 1 • 1 • 0.5 = 0.2

o Generalized to other N-gram sizes:

 i.e. P(I do not like them) = P(I) • P(do) • P(not) • P(like) • P(them)
= 5/15 • 2/15 • 2/15 • 2/15 • 1/15 = 0.00005235
* Probability depends on N-gram (because still use bigram Ps from above^)
o Perplexity is the inverse probability of the test set, normalized by the number of
words

* Minimizing perplexity = max Prob

* Some probabilities can be very small-> use log-space (log of sentence probability) =

 General procedure for comparing unigram, bigram and trigram models:

 Sparseness problem: N-grams that have P=0, deal with by:


o In training set (of fixed vocabulary) all words that occur only once-> set as chunk
‘unknown’ -> estimate model. Set all singles to ‘unknown’ in test set as well ->
compute probabilities
* but only works for large training corpus and with unigrams
o Laplace smoothing / Laplace estimate: assign some non-zero P to all N-grams that
never occur-> add one to count of word for all words (when calculating their prob)
c(wk) = count of wk
N= number of tokens in corpus
V = vocabulary size (all different N-grams?)
o Good-Turing smoothing: use frequency of tokens that occur only once and the total
number of tokens to estimate probability of tokens that never occur (our zero-
probability N-grams (slide 44).
Nc= occurrence count c of token N
c=count of non-zero tokens
then do P(word) * (c*)
o Backoff smoothing: when using less context helps-> use (N-1)-gram

o Interpolation smoothing: mix all N-grams

* Exam material: On Chomsky paper


Lecture 4:

Word classes / parts-of-speech (POS)


 Noun:
o Proper nouns: names (as mentioned before), usually starting with capital character
o Common nouns: abstract concrete things (car, mouse, etc)
 Count nouns: can be in plural (i.e. mice)
 Mass nouns: cannot be counted (i.e. salt)
 Verb: represent actions and processes
o Auxiliary verbs: used in combo with other verbs (be, have)
 Adjectives: describe properties of nouns (blue, tall)
 Adverbs: describe properties of non-nouns (time, direction, verb) (i.e. very, slowly)
 Preposition: describe relationships between nouns or noun and verb. (i.e. under, after)
 Determiners: noun modifiers which indicate reference (mostly to a noun) (the, a , an, this)
 Pronouns: refer to nouns
o Personal: standard (i.e. I, they, it)
o Possessive: indicate possession (my, your, their)
o Wh-pronouns: refer to unknown entities in questions (i.e. who, what, which)
 Particles: component part of a phrasal verb (i.e. get off, turn down)
 Conjunctions: words that join phrases together (and, however, etc)
 Other classes:
o Participles: (=deelwoord) verb in certain tense, can be present or past (i.e. doing,
done)
o Interjections: meaningless words (i.e. oh, uhm)
o Possessive ending: the ending with ‘s (i.e. John’s car)
o Foreign words: (=leenwoorden) like latin expressions in English
o Numerals (i.e. one, two, second)
o Punctuation signs
 To types of tags:
o Closed class: The members are fixed (prepositions, determiners, pronouns)
o Open class: Those that are never fully defined (nouns, verbs, adjectives and adverbs)
New words are added continually (i.e. emoji = emotion+icon, malware =
malicious+software)

English tagged corpora exist: corpora that have been tagged with parts-of-speech and checked
manually, can be
 Balanced: contain carefully selected sections of text from different domains
 parsed: also offers info about syntactic relations between words

POS tagging problem: words can have more than one POS (one word can be different tags,
dependent on the context)-> what tag should it be?.
 The annotation guide (book with tagging guidelines). For example, adjective vs noun:
substitute a more common adjective or noun-> one that still makes sense-> correct class
 Most words are very easy, but some words are hard (i.e. still can be 7 classes), however some
tags are a lot more likely than others (-> probability)
 With a very (much too) simple tagging algorithm, the accuracy is already 90%-> already
pretty high-> really need to try to achieve 100%
Part-of-speech tagging is useful in, i.e.:
 Speech recognition: POS of words tell what words are likely to occur around it
 Speech synthesis: tag tells something about pronunciation
 Morphological processing: tells what affixes it can take
 Parsing: for keeping lexicon out of parse rules (we want to keep rules as small as possible)
 Corpus linguistics: useful for finding instances of particular syntactic constructions (i.e. when
trying to find certain words with regex)

Part-of-speech tagging approaches:


 Lexically-based: just assigns most frequently occurring tag (in training corpus)
 Rule-based: error driven learning / TBL:
1. Assign most frequent tag to each known word
2. Guess best tag for unknown words based on morphemes
3. Create rule for correcting tag errors in data
4. Apply new rule and repeat 3 (and 4) until no further improvement
* Requires language specific knowledge
 Probabilistic: (with hidden Markov Models) assign tags based on probability of a tag
sequence occurring
With bayes rule, but P below the division (of bayes rule) can be safely ignored-> maximize
both Ps above division, so find tag sequence that maximizes
𝑃(𝑇|𝑊) = 𝑃(𝑇) ∙ 𝑃(𝑊|𝑇) = ∏(𝑃(t | t - ) ∙ 𝑃(w |t ))
 Deep learning: kind of probabilistic, with neural networks

Evaluate tagging: (Multinomial classification problem / multiclass problem)


 By raw accuracy=

 By K-score= output range= [0,1]

Random accuracy is obtained through permutation of tags you want to evaluate

 Binary classification: each training item is true/false or class A/B, etc


 Confusion matrix (matrix with FP,TP,FN and TN values)
 Accuracy:
Lecture 5:

Performance measures (of i.e. a classifier):


 Precision: what proportion of predicted positives are truly positive. So how valid are results?

 Recall : what proportion of actual positives is recognized as positive. So how complete?

 FB-measure: (combines both^, (because which one is more interesting?))

If beta is higher (i.e. 2) -> emphasizes recall, if lower (i.e. 0.5)-> emphasizes precision. (Usage
examples see slide 9)
* For multiple classes-> calc weighted average (see example on slide 11)

N-grams cannot deal:


 Syntax: is not the same as sentence probability (with only syntax, a shuffled sentence will
give the same probability)
 Long-distance dependencies: ‘the pizza with all these tasty toppings was/were delicious’ will
give were instead of was
 Hierarchical structure: ‘big brown dog barks’ will give (big brown) (dog barks) but should be
(big (brown dog) barks)

Syntactic analysis / parsing :


 Syntactic structure gives a clue to meaning of sentence.
 Phrase structure organizes words into larger units
o Words have parts-of-speech
o Combined into phrases-> phrasal categories
o Combined to bigger phrases-> nested (recursion)
o Gives order to words and gives hierarchical structure
 Syntactic parsing: task of inferring rules of grammar that generated a sentence, using
probabilistic context-free grammar, but parsing with this is exhaustive search.

Context-free grammar: contains of


 Non-terminal symbols
 Terminal symbols
 Rules
 Start symbol

Dependency grammar: some words depend on other words, a root is not dominated by another
word. Represented with arrows pointing towards the superior from the (dependent) subordinate,
where the arrows have labels
Lecture 6:

Ambiguity: both sentences are grammatically correct, but sentence can be interpreted differently ->
need interpreting preferences
 Transition-based dependency parsing: initial configuration -> classifier (which predicts the
parser action needed to move to next configuration) -> until terminal configuration (=root
node of tree is only one left in stack)
o Parser configuration consists of:
 Buffer: all words that still need to be processed
 Stack: all words that are currently being processed
 Partial dependency tree: some words are already put in a tree (without
dependency arcs)
o Parser actions:
 Shift transition (SHIFT): removes front word from buffer-> adds to top of
stack
 Left-arc transition (LARC): creates dependency for top-most word to second
top-most word of the stack and removes second top-most
 Right-arc transition (RARC): creates dependency for top-most word from
second top-most word of the stack and removes top-most word
 Train dependency parser with neural network for preferences.
 Measures for evaluating dependency parsers:
#
o Exact match score (EM): #
#
o Unlabelled attachment score (UAS): #
o Labelled attachment score (LAS): both arcs and labels are correct
#
#

Meaning = idea that is represented by a sentence / word / etc. (dictionary)


 Linguistic definition: signifier (symbol) that signifies a thing / idea / concept
 Denotational semantics: formalize programming language in mathematical language to
describe meanings of expressions from languages
 Feature theory: meaning is a list of features of an object (i.e. chair has legs, and backrest, is
furniture)
 Prototype theory: any given concept in any given language has a real world example that
best represents this concept
o Concepts are graded and can overlap

Wordnet stores words with meaning and semantic relation with other words (organized
hierarchically)
 has synonyms and hypernyms (=category that includes other words, i.e. color)
 Misses new meanings of words (needs to be consistently updated) and word similarity
cannot easily be computed->
o Could represent words as discrete symbols (each word has unique encoding
(=localist representation)). like binary code in a space, words are then represented
as one-hot vectors (all 0’s and one 1) -> every word is max dissimilar…
o Use wordNet’s hypernym hierarchy to obtain better measure of semantic similarity
(but doesn’t give intuitive results, and need too many 0’s)
o -> encode word similarity directly into real-valued vectors -> use cosine vector
similarity (good because looks at direction (the angle ϴ), not magnitude)

Representing word meaning by context:


 Distributional semantics: (For words instead of sentences.) Meaning of word is given by total
set of words it can be used with
o Context = set of words that appear near word w (in a fixed-size window) -> use
many contexts of w to build representation of w itself.
Lecture 7:

Word vectors = word embeddings = word representations: vector representation of a word


 Want dense vectors-> can compare similarity in context (kind of like a distribution of all
words where nearest neighbour is most similar (numbers are close together, then ‘one’,
‘two’ etc)
 ‘Word2vec’: framework for learning word embeddings.
1. all words (in large corpus) are represented as vector of random real numbers
2. go through all occurrences of the word w in a text, where w is the ‘center word’
3. use representation of w to predict context words by calculating 𝑃(𝑐|𝑤)
4. adjust vector to maximize all probs
o task is self-supervised.
o Skip-gram model: go from a center word to most probable context words (with
training, the probabilities of all context words to center words are set, depending on
#occurrences.)
 Softmax function: maps vectors to probability distribution. (By mapping a set
of words with random parameters into the hidden layer (of deep learning),
which are mapped to an output by the softmax function)

o Move from probability to the average negative log likelihood with objective function

 Minimizing the function-> maximize p in word2Vec

 Gradient descent: iterative algorithm for finding local min (of a differentiable function)
o Multivariate gradients of a point(so for multiple parameters): gradient function is a
vector itself (consisting of partial derivatives)

 Output error = log loss:

o Find partial derivative of the loss (so how much every parameter influences the output
error) determine: (slides 26-29) (not very important(?))

Test vectors by comparing with your intuition.


Some problems with word embeddings:
 Models (like word2Vec) do not capture polysemy: different meanings to the same word
 Meaning does not actually have an algebraic structure.
 It captures more the meaning of the context vectors than the word itself

Powered by TCPDF (www.tcpdf.org)

You might also like