Professional Documents
Culture Documents
Lecture 1:
Children learn first language in critical period (2-7 years) automatically and without any effort =First
language acquisition (it is believed that’s because infants have a language acquiring device which
enables the infants to acquire and produce language).
Learning a language in a later state of live is by choice and takes more effort (=Second language
acquisition)
Important tools:
finite-state automaton (Five-tuple) can be used to represent a language, can be
deterministic (always gives same result) or non-deterministic
regular languages (class of languages):
o only regular if a FSA accepts the language
o A* (=Kleene star) means A x A x A = AI
o Complement of a regular language is a regular language (same with intersection of 2
regular languages)
o English is not regular
regular expressions (search patterns): represents regular languages or finite state atomata
o i.e. a regex: a sequence of characters that defines a search pattern
o python regex syntax:
* some shortcuts:
there are automaton for word lists -> can be used (together form a language)
don’t want proper names (=names of person, locations, organisations, etc) in those list
because you will go on forever -> need to be identified. words starting with capital are names
or phrases consisting of such name or with phrases with van/der etc
o However input texts need to be pre-processed (i.e. removing capital characters) so
how to identify? And still doesn’t recognise all names.
Lecture 2:
Tokenization / segmentation:
Turn something (string/text) into tokens (=smaller chunks) (can also be just letters or words)
Mostly done as pre-processing for complex applications
Rules can be created with regex and FST (mostly weighted)
Often based on punctuation (i.e.: ‘?’, ‘!’, ‘.’, ‘ ’), but this does not always work (i.e. Yew York
will be split or ‘don’t’ -> ‘don’, ‘t’)
Hard for non-alphabetic languages (i.e. Chinese) -> use weighted FST
Inflectional morphology: (=language defines how words are built up from smaller parts) (creates
new forms of same words)
Morphemes= smallest units in a language that have some meaning, and cannot be divided
into smaller units without changing the meaning
Word can change form regularly or irregularly (small-> smaller vs good-> better), identified
by productive patterns (=rules, are more efficient than databases)
Morpheme types:
o Free morphemes: indivisible words (i.e. stop, fox, eat)
o Bound morphemes: cannot be words by themselves (i.e. affixes: s, ing, un, able)
* different from syllables
Hapaxes = words that only occur once in a text corpus. Are the evidence for morphological
rules. Meaning of words can be computed from the meaning of morphological constituents.
Rank of a word: position of a word when all permutations are written in alphabetical order.
(So for string A: change place of letters to get all different letter sequences-> write all made
(non-existent) ‘words’ in alphabetical order-> where string A is placed in the order is the
rank)
Stemming:
break down words into stem (art of word that is common to all variations) and rest (smaller=
small+rest)
Algorithm: Porter algorithm: series of rewrite rules implemented as FST to get the stem of
the word.
* is lexicon-free (doesn’t need dictionary, just slices words based on rules-> )
* But only works for regular words
Lemmatization:
The lemma represents all the forms / inflections of verbs
A lemma is a word itself, a stem doesn’t have to. (i.e. go -> represents goes, went, going)
Mapping from base form = lemmatization.
Uses database -> for irregular words, but sometimes thinks verbs are nouns (instead of
verbs) and gives wrong output-> can be identified, but only via hardcoding
Inflection vs derivation:
In inflection you don’t change meaning of the word (word remains within same lexeme
(=group of words with same shape and meaning) / overall meaning). And grammatical info
stays the same
Derivation changes the meaning and its lexeme (and often the syntactic class). Grammatical
info is changed (i.e. different word class)
o Need Morphological parser: divides words into their morphemes. Consists of:
Lexicon (=dictionary): with stems and affixes
Morphotactic rules: specify how morphemes can be combined (i.e. where
should affix be placed relative to stem)
Orthographic rules: specify the required spelling changes after combining
morphemes (i.e. cherry + PL = cherries)
* often are FSTs, hard to make
Examples:
Stem Derived word Inflected from
Work Worker Works
Red Redness Red (i.e. car)
Friend Friendship Friends
Lecture 3:
Language models (model that predicts the next word from previous words):
Task of assigning probabilities to the sentences of a language-> get word probabilities
Models can also be defined on characters or sounds instead of words
A chunk of N consecutive words is called N-grams (unigram, bigram, trigram, 4-gram etc),
can be used to estimate probability of sentence -> calculate by:
o Bigram approximation: (P of the bigram occurring)
(Based on Markov assumption: p of word wk only
depends on previous word wk-1)
i.e. P(do|I) = 0.4 , P(them|like) = 0.5 , P(not|do) = 1 , P(like|not) = 1
What is the probability of the sentence “I do not like them” relative to this
corpus?
P(I do not like them) = P(do|I) • P(not|do) • P(like|not) • P(them|like)
= 0.4 • 1 • 1 • 0.5 = 0.2
i.e. P(I do not like them) = P(I) • P(do) • P(not) • P(like) • P(them)
= 5/15 • 2/15 • 2/15 • 2/15 • 1/15 = 0.00005235
* Probability depends on N-gram (because still use bigram Ps from above^)
o Perplexity is the inverse probability of the test set, normalized by the number of
words
* Some probabilities can be very small-> use log-space (log of sentence probability) =
English tagged corpora exist: corpora that have been tagged with parts-of-speech and checked
manually, can be
Balanced: contain carefully selected sections of text from different domains
parsed: also offers info about syntactic relations between words
POS tagging problem: words can have more than one POS (one word can be different tags,
dependent on the context)-> what tag should it be?.
The annotation guide (book with tagging guidelines). For example, adjective vs noun:
substitute a more common adjective or noun-> one that still makes sense-> correct class
Most words are very easy, but some words are hard (i.e. still can be 7 classes), however some
tags are a lot more likely than others (-> probability)
With a very (much too) simple tagging algorithm, the accuracy is already 90%-> already
pretty high-> really need to try to achieve 100%
Part-of-speech tagging is useful in, i.e.:
Speech recognition: POS of words tell what words are likely to occur around it
Speech synthesis: tag tells something about pronunciation
Morphological processing: tells what affixes it can take
Parsing: for keeping lexicon out of parse rules (we want to keep rules as small as possible)
Corpus linguistics: useful for finding instances of particular syntactic constructions (i.e. when
trying to find certain words with regex)
If beta is higher (i.e. 2) -> emphasizes recall, if lower (i.e. 0.5)-> emphasizes precision. (Usage
examples see slide 9)
* For multiple classes-> calc weighted average (see example on slide 11)
Dependency grammar: some words depend on other words, a root is not dominated by another
word. Represented with arrows pointing towards the superior from the (dependent) subordinate,
where the arrows have labels
Lecture 6:
Ambiguity: both sentences are grammatically correct, but sentence can be interpreted differently ->
need interpreting preferences
Transition-based dependency parsing: initial configuration -> classifier (which predicts the
parser action needed to move to next configuration) -> until terminal configuration (=root
node of tree is only one left in stack)
o Parser configuration consists of:
Buffer: all words that still need to be processed
Stack: all words that are currently being processed
Partial dependency tree: some words are already put in a tree (without
dependency arcs)
o Parser actions:
Shift transition (SHIFT): removes front word from buffer-> adds to top of
stack
Left-arc transition (LARC): creates dependency for top-most word to second
top-most word of the stack and removes second top-most
Right-arc transition (RARC): creates dependency for top-most word from
second top-most word of the stack and removes top-most word
Train dependency parser with neural network for preferences.
Measures for evaluating dependency parsers:
#
o Exact match score (EM): #
#
o Unlabelled attachment score (UAS): #
o Labelled attachment score (LAS): both arcs and labels are correct
#
#
Wordnet stores words with meaning and semantic relation with other words (organized
hierarchically)
has synonyms and hypernyms (=category that includes other words, i.e. color)
Misses new meanings of words (needs to be consistently updated) and word similarity
cannot easily be computed->
o Could represent words as discrete symbols (each word has unique encoding
(=localist representation)). like binary code in a space, words are then represented
as one-hot vectors (all 0’s and one 1) -> every word is max dissimilar…
o Use wordNet’s hypernym hierarchy to obtain better measure of semantic similarity
(but doesn’t give intuitive results, and need too many 0’s)
o -> encode word similarity directly into real-valued vectors -> use cosine vector
similarity (good because looks at direction (the angle ϴ), not magnitude)
o Move from probability to the average negative log likelihood with objective function
Gradient descent: iterative algorithm for finding local min (of a differentiable function)
o Multivariate gradients of a point(so for multiple parameters): gradient function is a
vector itself (consisting of partial derivatives)
o Find partial derivative of the loss (so how much every parameter influences the output
error) determine: (slides 26-29) (not very important(?))