Computational Linguistics Notes

1.
Corpus Linguistics
What's a corpus?
A finite collection of texts, stored in electronic form, collected in a systematic and controlled way,
homogeneous and representative (both qualitatively and quantitatively) wrt a certain linguistic domain.
Corpuses can be classified according to different parameters:
 Genericity Modality Time Language
 Coding Extension Representativity Closed/Monitored
 Integrity (full/partial texts)
Before using a corpus, text normalization must be carried out → convert it into a more convenient form,
with word expansion (splitting a string into words), tokenization (e.g., Byte-Pair encoding), lemmatization
(determine roots) → the best alternative is the one with the least substitution/deletion/insertions
In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or
validating linguistic rules within a specific language territory.
Examples: Childes (1985): archive of spontaneous speech transcription of child-directed speech with CHAT
coding); Brown (1964): 1ml tokens, representative of written ENG, 15 categories, unannotated); Penn
Treebank (1986): 1ml tokens, fully syntactically annotated with standard tagging style TREEBANK II
How big (and balanced) should it be to be representative of a specific linguistic reality?

The general rule is that the more data you have the more representative it is. However, a larger number of
data might cause some noise. Depending on the phenomenon you’re interested in, you have to take into
consideration determinate parameters and your specific needs:
 Genericity Modality Time Language
 Coding Extension Representativity Closed/Monitored
 Integrity (full/partial texts)
What kind of annotation is necessary to add linguistic value?

We can have these levels of annotation:
 UNSTRUCTURED: only text
 STRUCTURED: specific/precise linguistic data annotation
 SEMI-STRUCTURED: corpora with convention annotation for conveying extra-linguistic information
(HTML pages, formatted text)
In order to make the corpora more useful for doing linguistic research, they are often annotated. This can
be done at different linguistic levels: POS-tagging (information about each word's part of speech),
lemmatization (base form of each word), full grammatical parsing (Treebanks) or Parsed Corpora, semantic
(synsets in Wordnet) and pragmatic parsing.
The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these
corpora are usually smaller, containing around one to three million words. But the depth of annotation
clearly depends on the intended use of the corpus.
How do we query corpora according to their level of annotation?

Unannotated corpora can be used to extract different types of information
 Regular Expressions, able to identify patterns [Ss]he  she or She
 KWIC → format for concordance lines with key and context
 FREQUENCY OF THE LEXICON
o Also used to calculate TYPE/TOKEN RATIO → richness of vocabulary
 N-GRAMS and Language Models based on them
(Semi-)Annotated corpora
 Grammar extraction
 Benchmark for POS Tagging and parsing tools
 Linguistic studies of the type “are children sensitive to the finiteness of the verb in FRA and ITA?”
Why are corpora useful for linguistic, psycho-linguistic and computational-linguistic studies?
Corpora can be used as a source of ecological data (eg. balancing a psycholinguistic experiment), creation of
dictionaries and grammars, for creation of linguistic models based on frequencies and distribution, linguistic
benchmark for NLP tools
Based on frequency and distribution in a corpus, we can create Language Models based on the notion of n-
gram (sequence of n tokens), which use various statistical and probabilistic techniques to determine the
probability of a given sequence of words occurring in a sentence.
 Next world probability = summed probability of all previous words is similar to the probability of
having the next word
2. Formal Grammars (morphology and syntax)

We need a theory that describes our linguistic competence = finite knowledge allowing us to
1. Recognise grammatical input
2. Assign correct meaning
Chomsky distinguishes between this and performance, the way a language is used in communication.
According to him, competence is the system that enables speakers to produce and understand an infinite
number of sentences and to distinguish grammatical sentences from ungrammatical ones.
What's a formal grammar?
A formal grammar for a language L is a set of rules that allows us to recognise and generate all and only
the sentences that belong to L and assign to them an adequate structural description.
It must be EXPLICIT (any judgement should be the result of mechanical application of the rules) and
CONSISTENT (for the same input, the procedure will always give the same result).
A grammar must provide an adequate description of the linguistic reality we want to describe
In order to formalize a grammar we need:
A → alphabet (finite) A*
V → vocabulary (potentially infinite) built by concatenating elements of A V is a subset of A*
L → potentially infinite set of sentences built concatenating elements of V L is a subset of V*
Eg. PHRASE STRUCTURE GRAMMAR (Chomsky, 1965) ordered 4-tuple (VT, VN, →, {S}):
1. Vt =Terminal vocabulary
2. Vn = Non terminal vocabulary Vt U Vn = V
3. {S} = subset of Vn, ROOT NODE
4.  = rewriting rule encoding both precedence and dominance relation; binary, asymmetric,
transitive
Defined on V*: ⱯAϵVn, φAψ  φτψ for some φ,τ,ψϵV*
DERIVATION: given two strings φ,ψ ϵV*, there is a φ-derivation of ψ if φ *ψ; if there is such derivation, φ
DOMINATES ψ (reflexive & transitive relation); the derivation is terminated if ψϵVt*
Given a grammar G, a language generated by it is L(G), the set of all possible strings for which a terminated
s-derivation of φ exists
Eg. Structural description (syntactic tree) ordered 5-tuple (V, I,D,P,A):
V= finite set of vertices
I= finite set of labels
D= dominance relation (weak) defined on V
P= precedence relation (strict order) defined on V
A= assignment function (from V to I)
(--------- GENERATIVE CAPACITY: set of sentences that can be generated

EQUIVALENCE of two grammars: weak (same strings) and strong (same structural descriptions)
DECIDABILITY: a set A is decidable if there exist an algorithm that for any element belonging to
universal set determines whether this elements belongs to set A in a finite number of steps
Adequacy of a grammar, 3 levels (Chomsky, 1965)

 Observational: the language described by the grammar coincides with the one we want to describe
 Descriptive: the grammatical analysis provides relevant structural descriptions, coherent with
speaker’s intuitions
 Explicative: the grammar is learnable and allows to draw conclusions on what is more difficult to
process)
What does this rule mean? S -> NP VP

This is a rewriting rule. On the left there is a non-terminal node which must be rewritten by two other non-
terminal nodes (on the right). It can only produce a sentence that contains two such non-terminal nodes. It
provides structural description: rule encodes dominance relation, S dominates NP and VP, but also
precedence, in this case NP, VP.
(--------------------REGULAR EXPRESSIONS: algebraic notations to express sets of strings, used to identify a

pattern in a text composed of alphanumeric chains and specials signs indicating relations among them
REGULAR GRAMMARS: 4 tuples similar to PSG (Vt, Vn, , {S}) with some extra constraints:
 all production rules have at most one non-terminal symbol on the right side of the rule
 the position of the terminal symbol on the right side of the rule is consistent (bc the elements there
are ordered in a precedence relation (vs strict minimalism, only hierarchy)), yielding left-recursive
or right-recursive languages
Automata: mathematical computation models composed by states and transitions among states;
conceivable in form of graphs. A FINAL STATE AUTOMATA (FSA) is a machine composed of a set of states &
possible transitions from one to the other; 5-tuples <Q, Σ, Q0, F, δ> such that
 Q → finite set of states
 Σ → characters acceptable as input
 Q0 → initial states such that Q0ϵQ
 F → final states
 δ→ transition states
For FS grammars, language is a graph that connects words through transitions, w/o memory of the
previous stages. They can be used to recognise strings/words or represent a sentence/language
RE =FSA=RG –> describe REGULAR LANGUAGES))
Are you able to define recursion? How do you implement it in Regular Expressions (RE), Finite State
Automata (FSA), Regular Grammars (RG)?
Recursion is the basic property of NL that allows us to make infinite use of finite means, or, in other words,
the ability to place one component inside another component of the same kind.
A RECURSIVE RULE is a rule in which the output of first application can be fed as input of a second
application of the rule. The same recursive mechanism is implemented
in RG: the same non-terminal element at the left and at the right of the rewriting rule
S aR, RbR, RØ
in RE: Kleene closure → all the possible concatenations of an element including the null one ab*
in FSA: inserting a loop into a node (Q0—a-->Q1--b)
(------------------- CONTEXT FREE GRAMMARS (CFG) are RGs without restrictions on the right side. Only
admits rule of the type Aγ, where γ is any sequence of terminal/nonterminal symbols)
Can express/describe SYNTACTIC AMBIGUITY (eg. SaSb, SØ)
PUSH DOWN AUTOMATA (PDA) is an FSA endowed with a

memory stack Γ (only the last element is accessible)
Can parse MIRROR RECURSION (XX®), a type of CENTRE-
EMBEDDING.
CFG AND PDA both describe CONTEXT-FREE LANGUAGES)
What do Pumping Lemmas tell us?

Based on Pidgeon Hole Principle: if p number of pigeons are placed into fewer than p holes, some holes
have to have more than one pigeon in it; used to decide whether a string can be captured by a certain class
of grammars.
Describes a property of RLanguages: every string in the language will have a

section p long that can be repeated (pumped) and still be in the language. The
pumping part is the recursive part (right recursion: the dog bit the cat that
chased the mouse that ran).
It can be used to show that a language is not regular (not described by FSAs):
Anbn (CENTRE-EMBEDDING - COUNTING RECURSION) not generated by RG/FSA (finite amount of
memory) but present in NL (eg. IF…s…THEN…s… structures)
Describes a property of CFLanguages: for every CFLanguage L, it is possible to find

two substrings (v and x) that can be ‘pumped’ any number of times and still be in
the same language.
It can be used to show that a language is not context free:
XX (CROSS-SERIAL DEPENDENCY/IDENTITY RECURSION)  not generated
by CFG but present in NL (eg. X, Y and Z are respectively X, Y, Z)
Why are Context-Free Grammars "more powerful" that Regular Grammars?

RG: strongest possible constraints that will generate the simplest possible language are A→ xB (one non
terminal on one side and one on the other) + consistent position: either right or left-recursive grammar
Dominance and precedence are strictly correlated
CFG: RG without restriction on the right side. It has binary branching and allows for deep embeddings
It is more powerful because it can capture counting recursion and it can describe in a more straightforward
way syntactic ambiguity (rules w/ same left-side symbol should be present to allow for ambiguity &
exponential problem: the more PP the more possibilities) which require freedom in having nesting 
precedence not correlated w/ dominance
Discuss Chomsky's Hierarchy.
Rewriting Rules restrictions create formal grammar classes organized in an inclusion hierarchy:
Type 0 → Turing Equivalent Grammar→
implemented by a Turing Machine → recursively
enumerable language → no constraints
Type 1 → Context Sensitive Grammar →
Linear-bounded automata → any production rules
may be surrounded by a context of terminal and
nonterminal symbols.
Type 2 → Context Free Grammar → PDA →
capture counting dependencies → A → b (b=
sequence of (non)terminal symbols)
Type 3 → Regular Grammar → FSA  at most
1 nonterminal symbol on left side, consistent order
Natural Languages are considered by Chomsky to by Mildly Context-Sensitive Languages
Are Context-Free Grammars sufficient to express any Natural Language property?

Recursion, as a basic property of NL, is not entirely captured by both RG and CFG. In particular:
 RIGHT RECURSION (ABn) is captured by both: [the dog bit [the cat [that chased [the mouse]]]]
 CENTER EMBEDDING/COUNTING RECURSION (AnBn) is not captured by RG, only by CFGs: [the
mouse[that the cat [that the dog] bit]chased]ran]]]]  also in IF…THEN…
 CROSS.SERIAL DEPENDENCY/IDENTITY RECURSION (XX) is not captured by neither: eg. X, Y and Z are
respectively X, Y, Z
3. PARSING AND COMPLEXITY

(-------------INPUT NORMALIZATION: different levels have different strategies, but the main idea is to match
patterns against lexical items stored in memory + some heuristics for substitution
 Symbolic approaches (explicit heuristics)
o MINIMAL DISTANCE (Wagner, 1974): the best alternative minimizes the number of
insertions, deletions, substitutions and switching of characters  compare form with any
possible transformation of this form and check for alternatives in lexicon
LIMITS: huge number of calculations
o SYMILARITY KEY – SOUNDEX algorithm (1918, 1962): extract a key from wrong form and
form other lexical items; items sharing same key are good candidates
o Improved similarity key – SPEEDCOP algorithm (1984): 2 keys
 Sub-symbolic approaches (machine learning)
o N-GRAMS (1980) idea: a word is a set of overlapping n-grams; a vocabulary is an indexed n-
gram table – every index is a word in the lexicon
 T9 – needed resources: dictionary, frequencies + Fitts’ Law
 BYTE-PAIR ENCODING (2016) – idea: guess the token without tokenization; start w/
a vocabulary consisting of each character + count most frequent co-occurrences,
merge them & add them to the lexicon, substitute items with their merge in corpus
+ repeat k times
 Language Models assign probabilities to the sequences of words/n-grams;
probability of having a word given its history = P(w|h)
probability of a sentence calculated by using chain rule of probability
P(w1,w2..wn)=P(w1)P(w2|w1)P(w3|w2)  n-grams approximation: P(wn|w1..wn-
1) = P (wn|wn-(n+1)..wn-1))
Computational lexicon (=efficiency of lexicon’s representation) can be evaluated based on
1. COVERAGE (domain, featural info)
2. EXTENSIBILITY (how easy is to enrich it)
3. UTILITY → how performative it is for the analysis
Single entry structure (ortho-phonetic, morphological, syntactic and semantic info, eg. XML, DTD, TSV) vs
Global lexicon structure ( idea: subcategorization might be related to semantic class – infer the semantic
class on basis of hierarchical organization of items in an ontology; eg. WORDNET, a semantic network
organized based on meaning, where each lexical concept is a synset, represented using its synonyms 
polysemy solved by creating 2 synsets; not language-specific)
Evidence for a structured lexicon from psycholinguistics: priming effects, pronunciation errors
Goal of morphology: recognize a well-formed string & decompose it into morphemes. Morphological
analysis can be applied for
1. INFORMATION EXTRACTION → roots, not tokens
2. KEYWORDS EXPANSION → strip inflection
3. STEMMING → retrieve word root (stem); eg PORTER STEMMING ALGORITHM, set of cascade FST)
What's the Two-Levels morphology model?

By Koskeniemi (1983), in order to integrate morphology and phonology, we should distinguish between a
 LEXICAL FORM (#c a s a + e#)
 SUPERFICIAL FORM (#c a s e#);
these are integrated in the lexicon through a series of transducers, special types of final state automata,
FINAL-STATE TRANSDUCERS (FST).
They allow to recognise and associate a structural description to the element into consideration as
belonging to the lexicon. It is different from a simple morphological recognizer/FSA, which only operates a
string-to-string match (→ not able to describe whether it is singular or plural, so also not able to associate a
structural description: it will only say yes/no)  whereas FSAs define a formal language (set of strings),
FSTs define relations among languages.
An FST is a 5-tuple <Q,Σ, qo, F,δ>

 Q= finite set of states
 Σ= finite set of characters of the i:o form (ΣϵIxO)
 Q0= finite set of initial states
 F=finite set of final states
 δ= transition matrix (q,q’, i:o) putting into relation q with q’ if i:o is defined (from QxΣ to Q)
It basically works as 2 FSA combined, where one recognises the Surface String and the other rewrites it in
its Lexical form (or the other way around for the property of inversion, two directional process). All in all, it
is similar to a FSA but the lexicon and transition sets are more rich → input to output alphabet, transitional
state put in relation an initial state with a final state.
FST are able to recognise/generate/translate and correlate sets; can be used in parsing.
(-----------PROBLEMS with FST:

 non-determinism (exponential growth: the more possible transitions, the more memory needed)
 inadequacy (with non-concatenative morphology, eg. semitic languages)
 order of application (mattino → classified as small crazy)
What does it mean to "parse a sentence"?

Informally, to divide a sentence into subcomponents and identify the parts and the relationships among
them. Formally, it means that given a Grammar G and an input I, to apply a function P(G, i) able to
1. Accept or reject i
2. Assign to i an adequate descriptive structure (syntactic tree)
So, it means determining is a string x belongs to the language generable by G → Universal Recognition
Problem (URP). The complexity of the URP for a linguistic theory is a direct measure of how difficult it is to
parse the languages generated by the class of grammars specified by the theory
(-----------URP can be reduced to a SAT problem (problem of determining if there exists an interpretation
that satisfies a given Boolean formula), which is a non-deterministic (because of the problem space)
polynomial time problem (NP). It is one of those problems that it is difficult to find a solution for, but in case
there is an oracle, the answer can be easily checked. Why is it an NP problem? Because of ambiguity!
Several strings in NL can receive an ambiguous value (lexical flies V vs N, semantic vecchia adj vs noun,
syntactic PP attachment).)
How can Context-Free Grammars be used in parsing?

When proceeding through the search space created by the set of parse trees generated by a CFG, a parser
can work either top-down or bottom-up.
 When we proceed TOP-DOWN, we start from the root node S and then we work our way to the
leaves; PRO: we don’t explore illegal parses (without S); CON: waste time on trees that won’t match
the input
o Better with lexical ambiguity: every time one is found, the representation is split in parallel
(- costing, - irrelevant output)
 When we proceed BOTTOM UP, we start from the input and work our way to the root node S; PRO:
never explore trees inconsistent with input; CON: waste time exploring parses w/o S
o Better with syntactic ambiguity: creates chunks of derivation and then tries to connect
them (- possible ungrammatical partial processing)
Ideally, we would want to have a control strategy for both
How is complexity expressed? Which dimensions are used for calculating the complexity of an algorithm?
The complexity of a problem is directly proportional to resource usage. In particular, given that a
computation is the relation between an input and an output, and that its completion is the reaching of a
final state, we have:
 TIME COMPLEXITY → number of steps required to reach the output
 SPACE COMPLEXITY → quantity of information to be stored at each step
As the complexity generally increases with the size of the input, the complexity order of the problem can
be expressed in terms of input length (as representing the mapping between input & output).
Since it grows to the infinite, as presumably NL do, the growing rate of the lexicon is crucial to determine
the tractability/computability of the problem, that is, if a procedure exists & terminates with an answer in a
finite amount of time.
Computational complexity doesn’t seem to be strictly related to psycholinguistic complexity, i.e., the
difficulty of processing a sentence. Rather, it might be due to a limited processing capacity (eg. you pay a
cost whenever you have to store an element and storing elements that have similar features might
generate confusion).
 Limited-size stack (Yngve 1960): language processing uses a stack to store partial analyses; the
more partial processings stored, the harder the processing (eg. multiple embeddings)
 Syntactic Prediction Locality Theory (Gibson 1998): the total memory load is proportional to the
sum of required integrations + referentiality needs (a pronoun is less costly that a DP or a definite
description.
Does the size of the sentence matters?

Yes, in fact, complexity is proportional to both resource usage and problem dimension (= input length), as
we expect complexity to increase with the size of the input, since the computation is a relation between
input and output.
Is it better to go Top-Down or Bottom-Up? Is the answer to this question the same in any context?
Grammars are declarative devices that don’t specify algorithmically how the input must be analysed.
Grammatical rules predict that from a root node, sentence expansion will lead to terminal nodes (words).
Words show how the sentence expansion must terminate. To parse a sentence, we can start both ways:
 TOP DOWN (goal driven) algorithm
explores all possible expansions of S offered by the grammar rules up to the leaves; it is better with
lexical ambiguity since every time there is ambiguity the representation is split (assuming parallel
processing) and one solution or the other is pursued. This is however costing & the algorithm may
spend time processing trees incompatible with the output “la regola regola la regola”
 BOTTOM UP (data-driven) algorithm
It starts from lexical elements and then phrase by phrase goes up to the root node.
It is better with syntactic ambiguity but it might create ungrammatical partial processing that won’t
lead to a root node
Both are complete strategies: they get to the solution eventually; they are roughly equivalent, but
TOP DOWN is better for sentences that are less syntactically ambiguous and more lexically amb.
BOTTOM UP is better for sentences that are more syntactically ambiguous and less lexically amb.
How does Left-Corner Algorithm operate?

It is a Top-Down algorithm filtered by Bottom-Up considerations.
It starts from the presupposition that every non-terminal (NT) can be
rewritten at some point by a word in the input & the possible left
corner (LC) of all NT categories can be determined in advance and
placed in an OFFLINE TABLE.
It consists in expanding the tree through the left side of the rule (eg. A  *B a  α);
It is a great improvement since it always expects a particular category (like TD algorithms)
and, contrary to a normal BU algorithm, where all elements of the right-hand-side of the
rule must be recognized before applying the rule, here it suffices that the LC is recognized,
and then the other elements will be predicted TD, but there are still some unsolved
problems:
 left-branching-recursion A  *A a (DP  DP PP)
 inefficiency in subtrees analysis (with PP adjuncts) – exponential growth of
alternatives with n of PP
 coordination
These incomplete analyses are solved by dynamic programming. Dynamic programming reuse useful
analysis by storing them in tables (or charts). Once sub-problems are resolved (sub-trees in parsing), a
global solution is attempted by merging partial solutions together
How does Earley parsing operate?

EARLEY PARSER is a top-down, parallel, complete,
dynamic programming approach. DYNAMIC
PROGRAMMING means to solve a problem by breaking
it down in smaller sub problems and store them in
tables/charts; then, once sub-problems are solved, a
global solution is attempted by merging all partial
solutions together. The search mechanism is TD, with a
left-to-right pass that fills a chart with N+1 entities
The charts include 3 levels of information:

 Subtree corresponding to each single grammatical rule (can be then shared by all parses)
 Progress in the completion of the rule: a dot indicates the processing step (dotted rule)
 Position of the tree w.r.t. input position
The EARLEY ALGORITHM is composed of 3 operations

 PREDICTOR
o Adds new rules in the input charts representing Top-Down expectations in the grammar;
for any state with an NT immediately to the right of the dot, a new state is created for each
alternative expansion of that rule provided by the grammar
 SCANNER
o examines input in the expected position, triggering an advancement when the word is
recognized as belonging to the expected PoS. A correct scan introduces a new rule in the
next position of the chart with the dot advanced over the predicted category
Problem: large lexicons
 COMPLETER
o Finds and advances all previous states that were looking for this category at this position
in the input; new states are created copying the old state, advancing the dot, installing the
new state in the current chart entry
What's a Minimalist Grammar?

In linguistics, the minimalist program is a line of inquiry that has been developing inside generative
grammar since the early 1990s, starting with a 1993 paper by Noam Chomsky. It aims at capturing linguistic
universals to explain the limited syntactic variability across languages and providing the theory with
explanatory adequacy (learnability and evolvability), through the means of a minimal and unique structure
building operation: MERGE.
The MINIMALIST GRAMMAR, formalized by Stabler (1997), is a 4-tuple {V, Cat, Lex, F} st
 V → a finite set of non-syntactic features (P U I, where P=phonetic, I=semantic features)
 Cat → finite set of syntactic features = { base U select U licensors U licensees }
Base → standard categories PoS {D, N, V…}
Select → specify selection requirements {=D, =N, =V…} merge
Licensors → satisfy licencee selection requirements {+wh, +case}
movement
Licensee → specify requirements, force phrasal movement{-wh, -case}
 LEX → finite set of expressions built from V & Cat
 F  set of two partial functions from tuples of expressions to expressions {MERGE, MOVE}
→NEW: the building operations are introduced in the grammar
MERGE → feature driven operation: the selecting feature is the one that projects and
selects a compatible feature element (VS Chomsky’s assumption, according to which merge
is free, not a feature-checking operation so that rejecting a structure happens later in the
process
Problems:
 successive cyclicity: movement is not long distance; rather, it goes step by step. But how many
features do we have to assume at the beginning, since it’s a bottom-up derivation?
 Islandhood – grammar is blind to left branching; only the right branch can be inspected
How do you estimate complexity in terms of intervention/locality?

(-----------We have seen that CFG rules are not cognitively plausible and inefficient for accounting for the
overall language system. They are i) language specific, ii) too many rules, iii) not learnable, iv) no
generalization is possible. The alternatives are then:
 P&P (Fong 1991): linguistic universals capturing limited syntactic variability across languages +
parameter setting  the interaction among them produces many alternatives.
We have explicative adequacy (learnability & evolvability), but principles are computationally
inefficient.
From parsing perspective, the parser should work as a PDA, but huge ordering problem (16!
combinations), vacuous computation that needs to be filtered out.
 Minimalist grammars: try to simplify P&P approach assuming one innate component (MERGE) 
grammar is a dynamic entity, not just a list of rules.
o commit to a Bottom-Up approach, computationally and theoretically good, but psycho-
linguistically implausible since merge & move operate in opposite direction wrt the parser
 Top-down derivations: cognitively plausible, Merge & Move can be implemented, and locality can
be captured too → Phase-Based Minimalist grammar, in which we can calculate difficulty and
complexity word-by-word in a way that is processing friendly. Here we reconcile the formal
account of intervention (what counts as intervener) with processing evidence (when and how,
essentially, it is a retrieval)
A PHASE is the minimal computational domain in which a selection requirement must be satisfied; given a
lexical item [=Y X], [Y…] is the selected phase  merge reduces to lexical selection. If we assume that
selection includes both functional and lexical features at the same time, a Phase is a subtree to expand.
MOVEMENT in this account is triggered by only partial expectations, which cannot be fully interpreted in
their position – the element is then stored in memory buffer (first-in last-out) and retrieved when needed;
phases restrict domain in which non-local dependencies must be satisfied)
In the INTERVENTION ACCOUNTS OF COMPLEXITY (Rizzi 1990), the processing difficulty is said to be
proportional to the number and kind of relevant features
shared between the moved item and any possible
intervener. An example of long-distance dependency is that
of Object Clefts. In relation to this, Warren & Gibson (2005)
compare three types of DPs in Object Clefts (definite
description VS proper name VS pronoun), and the results of
the reading times show that not only the type of intervener
matters, but also their position  an integration cost is not
enough (in Linguistic Integration Cost (Gibson), difficulty is
proportional to the distance in terms of number of
intervening discourse referents, following a referentiality
hierarchy). At the same time, intervention-based accounts
are not gradable: standard bottom-up theories can only
predict what creates complexity, but nothing clear on the
processing (eg. why is slowdown observed at the verb
segment?).
We can also derive Object Clefts top-down (PMG): in CUE-

BASED RETRIEVAL AND INTERVENTION, intervention is
expressed as interference at retrieval; the locus of
interference, the major constraint to access information in
memory, is at retrieval (Van Dyke & McElree 2006), with
little effect on memory encoding or storage.
We can therefore calculate the Feature Retrieval Cost,
basically stating how tough it is to retrieve something from
memory given that you have multiple elements in there. It
is proportional to the number of elements in memory and
to the number of features to integrate; it is reduced by
number of features cued by the element that attempts
retrieval.
We can also calculate the feature encoding cost, the
numerical value associated to each new item merged,
proportional to the number of new relevant features integrated in the structure.
By using a Phase-Based Minimalist Grammar Parsing ( a left-right single path algorithm) we can make
predictions about the difference in complexity between OR and SR clauses: considering the embedded
predicate region, at the same time while in SR we discharge an element from the memory buffer, in OR
we integrate an element in the memory buffer and then discharging it later on. So, the number of
elements in the memory buffer matters.
 Depending on the features that we store in the memory buffer and the feature that the predicate
attempts to retrieve, we can predict what can be retrieved more easily or harder in a gradual way.
By rephrasing the INTERVENTION BASED idea in Top-Down terms, we test PMG and CUE-BASED
RETRIEVAL and INTERVENTION ( matching condition is bad but not with pronouns).
4. Language in the brain and Machine Learning

(----------SYMBOLIC REPRESENTATION (what we’ve seen so far): representational system in which atomic
constituents of representations are themselves representations; characterized by a compositional syntax &
semantics
 Symbolic processors take a symbolic input and give back a symbolic output by means of
hierarchically ordered logical rules, and can create a symbolic network
 Computer metaphor: the software is crucial (algorithmic/specification level), whereas hardware
mapping (implementation level) is fixed and automatic  neurons implement all cognitive
processes in the same way by supporting the basic operations that are required for symbol-
processing
o Assumptions: symbols are available to the cognitive system, and they constitute the
processing engine which is systematic (combinatorial symbolic system & compositionality)
and made of recursive knowledge structures
o Syntax: set of rules; semantics; plugs constituents in accordance with the rules
 Learning theories of symbolic architectures are limited, so they turn to evolutionary explanations
(eg. Chomsky’s UG)
SUB-SYMBOLIC (IMPLICIT) REPRESENTATION: its constituent entities are not representations themselves;
succeeds in having an idea of what’s a constituent without having a rewriting rule; only input and output
are coded as discrete/symbolic entities; no explicit notion of category
 Owes to neurobiology: it comes from the analysis of our brain: we can parcel our brain into
different functions (perceptive and performative), a dynamic functional system, and the system
complexity is an emergent property of simple interactions among parts
 Semantics is highly non-compositional, but maximally affected by context
 This idea was implemented in computer processing → PARALLEL
DISTRIBUTED PROCESSING: no explicit competence (what we know)
representation, rather a procedure (connection between simple entities),
because our brain has memory which allows us to have a processing model
 The goal is to predict the complexity of the system as an EMERGENT
PROPERTY → a complex behaviour carried out by simple elements
 Useful for COMPLEX PROBLEMS → we don’t have a representation of the problem space (no idea
on initial state etc), algorithmic solutions are too complex → we find an approximation of the
solution (shortcuts, the machine can figure it out by itself))
Can we "replicate" brain activity using artificial networks?

Central nervous system (brain + spinal cord) is where the analysis of information takes place. This is
possible because of the neurons that populate it (electrically excitable cells that receive, process and
transmit information), which communicate by using electro-chemical signals (eg. an electric potential
differentiating what’s inside & what it’s outside) via their synapses (=channels, which are open or closed,
which if there is some level of activation (threshold), will send signal to the next cell, if not, the signal does
not pass from one neuron to the other). Their learning activity depends on the number of connections
that neurons establish with each other.
There are 2 kinds of connections and different types of neurons.
 Electrical: very quick, ions pass bi-directionally from one neuron to the other
 chemical (most): neurotransmitters get attached to membrane, released, then they get to the other
neuron if they are compatible with it → slow and directional connection
Just as neurons are the basic components of the CNS, ARTIFICIAL NEURONS are the elementary units in an
artificial neural network; their interaction might be extremely complex (emergent property), almost brain-
like. The artificial neuron is a simple processing unit linked by weighted connections. It receives one or
more inputs and sums them to produce an output. Usually each input is separately weighted, and the sum
is passed through an activation function.
 an and a0 are two independent activations,
 w are weighs, inhibitory or excitatory connections
 Net function will be a sum of all the activations
An ARTIFICIAL NEURAL NETWORK (ANN) is an

interconnected group of nodes, similar to the vast network
of neurons in a brain. Here, each circular node represents
an artificial neuron, and an arrow represents a connection
from the output of one artificial neuron to the input of
another.
We might have recurrent networks or feed-forward
networks (every time you leave a neuron you never go back), or even complex networks, a mix of the two.
 Such systems "learn" to perform tasks by considering examples, generally without being
programmed with any task-specific rule: they automatically generate identifying characteristics
from the learning material that they process.
(-------------CLASSIC ANN examples:

 PERCEPTRON: algorithm for supervised learning of binary classifiers (= a function which
can decide whether an input, represented by a vector of numbers, belongs to some
specific class). Basically, it was initially conceived as a set of neurons connected to cells, the
goal of which was to recognise elements (pears or apples).
We only get as input 0 and 1, the gradation of input depending on the colour of the cell. We have
different weighs (from 0 to 1). With supervised learning we inform the network that the solution is
x and inform the weighs that distributed in a wrong way, so they will be tuned into a
good approximation of the result.
 PATTERN ASSOCIATOR: to each input corresponds one output
 KOHONEN MAP: unsupervised machine learning technique; a semantic network with
propagation of inputs without having anything in output: you just look at the propagation of
the network and compare if two concepts are similar or not.
 MULTILAYER NETWORKS: input layer, HIDDEN LAYER and output layer;
We analyse the HL (set of nodes that receive signal from input layer with a certain
level of activation. We can calculate distance between different vectors. The closer
they are=1, the more distant =0); cluster analysis: once we compare all vectors of
hidden layer, we cluster together things that are more similar.
From a formal point of view, three layers should be able to solve any problem.
 RECURRENT NETWORKS: a hidden layer that make a copy of itself
can handle arbitrary input/output lengths because of its internal memory
Ideal for speech and text analysis
 CONVOLUTED NETWORK: independence of position and
size in recognitions (face recognition). It adds layer after
layer restricting size to get pattern extraction; it uses
connectivity patterns between its neurons and is inspired
by the organization of the animal visual cortex;
Ideal for image and video recognition
Two options to train an ANN

 SUPERVISED LEARNING: inform a network that an output should be produced. Inform it if the
output is wrong or correct and tune weight → not suitable to mimic learnability
 UNSUPERVISED LEARNING: implicit learning - no information on the given/expected output; we
hope it will minimize weight distribution to maximize efficiency  coherent with the idea that we
learn a language without a given/expected output
o HEBBIAN LEARNING  "Cells that fire together wire together", that is, simultaneous
activation of cells leads to pronounced increases in synaptic strength between those cells.
Useful in extracting regularities (?) Δw ij = ηaiaj
o BACKPROPAGATION → proportional redistribution of the errors back-ward, layer by layer,
up to the input node. (eg. NN gives 1 but the expectation was 0, so we redistribute error
between connections across layers). useful to train Multilayer Networks Δw ij = ηδipojp
(i = out neuron, j = in neuron, p = activation pattern, η = learning rate, δi p = measure of error i with respect
to pattern p, ojp = coefficient granting that the error is proportional to the activation received from j)
 Learning = changing weights in our connection (no rewiring or anything)
Learning is a DESCENDING GRADIENT: you have to minimize errors, find a good balance in order to get the
lowest possible error.
 LOCAL MINIMUM: the best possible solution given a certain context (not the most optimal
solution); either you enlarge modification span, or leave in the local minimum. Local minimal makes
the problem very complex, sometimes the network does not learn because of this.
TIME in ANN → processing flow

 Epoch: one input processing (eg. given epoch T0, the next input T1 won’t be related with the
processing of input 0)
 Atemporal processing: activation only depends on input, connections and weights
 Temporal flow simulation trick: input divided in groups; each group a distinct temporal interval)
How do you code inputs and outputs with artificial neural networks?
The neural network is composed of a huge number of neurons capable of generating electrical pulses or
spikes at a great speed, and it seems reasonable to assume that the pattern of firing is used to code
information. How is the pattern activity used as a code?
 LOCALIST CODING: separate representations (=nodes) coding for distinct pieces of information, so
that 1 word = 1 node (e.g. 4 input units: a (0001) b (0010) c (0100) . (1000))
o Each unit is assigned a particular meaning: identifying “apple” would involve the activation
of neurons that code “apple”
o Typical of a more symbolic approach: cognition as formal manipulation of symbols with
explicit symbols to represent WORDS AND CONCEPTS
 DISTRIBUTED CODING: information is coded as a pattern of activation across many processing
units, with each unit contributing to many different representations: identifying “apple” requires
the activation of many neurons, each of which are also involved in the coding of other concepts
o using a binary coding, we can use 2 bits, for representing 4 elements (a, b, c, d) that is, 2
input neurons (a=00, b=01, c=10, d=11)
o Usually associated with connectionism and a more sub-symbolic approach: cognition does
not require explicit symbolic codes, bc information is distributed across the whole system
o Advantage: they require less units, but at a price: no clear way to code the simultaneous
presentation of two stimuli, since it is difficult to tell which features should be bound
together, and the same neuron firing for two different encoding may introduce a bias in the
network; in this case a localist coding is better in order not to have a bias
(One may also have cases in between, but the essential difference revolves around the question of whether
the activity of individual units is interpretable.)
What kind of linguistic phenomena have been studied using artificial neural network simulations?
 PAST TENSE (Rumelhart & McClelland, 86): we see a clear
linguistic pattern: 1) few high frequency verbs are learned in
crystalized forms (break), 2) over-regularity phase (break 
breaked), 3) irregular verb inflection is learned with a smooth
coexistence of irregular and over-regular verbs until only
correct forms are used (break  broke, breaked  broke)
This pattern can be replicated in an ANN with a human-like performance.

Based on the idea of n-grams, we decompose the input
word in syllables/trigrams in a phonetic input coding
(Wickel-features), then we connect these with their
possible inflection, and we connect one-by-one the
expected output to relevant syllabic pattern. Tuning the
connections between phonetic features has been
achieved after 84.000 examples. At the beginning the
behaviour is random, then the network learns the very
frequent verbs (= phase 1 in children), then there is a
drop with exception, then it goes back to being adult- like.
However, there is a difference between the network and human learning: tuning in a supervised way
requires a lot of examples VS the child (apparently) does not have them: it only takes a couple of
exposures. We don’t need supervised learning to learn past tense; this NN learns past tense, but they need
feedback on their performance.
What kind of artificial network architectures better fits linguistic needs?

Elman (1990): a multilayer network in which hidden layer activation is
copied on a CONTEXT LAYER that will be added to the next input
activation. The hidden layer from the previous time step provides a form
of memory, or context, that encodes earlier processing and informs the
decisions to be made at later points in time. No fixed-length limit on prior
context.
This is more psycho/neurologically plausible: priming effects (matching
an expectation, facilitation by coherent input to an expected output)
RECURRENT NEURAL NETWORKS/SIMPLE RECURRENT NETWORKS are networks that
contain a cycle within their network connections, meaning that the value of some unit
is dependent on its own earlier outputs as an input. The key difference from a
feedforward network lies in the recurrent link: the current hidden unit values can directly
affect the next state of the hidden units  memory for past inputs and ability to deal
with integrated sequences of inputs presented successively.
(SRN performance is independent of number of hidden units:

the size of hidden unit layer, when sufficiently large, does not influence the processing capability with
constructions of varying size.)
But how to decide what is useful in the input in order to
predict long-distance dependencies, like the ones we find
in human languages? The solution is Long Short-Term
Memory (LSTM): it is a solution to the problem of
vanishing gradients (if you try to back-propagate an error,
in time the error gets lower until it disappears; if it is 0 you
learn nothing); it provides us with a context able to store
something relevant and to forget something irrelevant.
It has the same architecture of a SRN, but the hidden layer

is more complicated:
 The input goes through both the hidden layer and context
layer (where we store input so that it gets multiplied by the
hidden layer) – there you either produce something close to
1=remember or 0=forget
 At next step, you feed both the hidden layer and the context
layer
In this way we are able to keep track of long-distance dependencies.
How do you feed simple recurrent networks (SRN) and what do you expect as output?
SRN can deal with sentence processing if the input is revealed gradually over time, rather than being
presented at once. The input structure is localist (1 neuron = 1 input): the first token in the stream is
presented to the input layer. Activation is propagated forward. The target pattern for the output layer is
the next item in the stream of tokens: the output is compared to the target, delta terms are calculated,
and weights are updated before stepping along to the next item in the stream.
After processing the first item in the stream, the state of the hidden units is ‘copied’ to the context layer,
so that it becomes available as part of the input to the network on the next time step. (State-of-art
intuition: use hidden layer activation at next step)
(----------------SRNs are useful for discovering patterns in temporally extended data. They are trained to
associate inputs together with a memory of the last hidden layer state with output states. In this way, for
example, the network can predict what item will occur next in a stream of patterns.
 GUESSING NEXT WORD paradigm: the network has learnt to perform a task if it expects something
related to the input; in this way the network can learn a category implicitly
 e.g. the house is red (auto‐supervised learning)
Input = the
Output = house)
Can recursion be learned by SRN?

The recursive properties found in language are:
 Right-branching (not properly recursion, iteration)
 Counting Recursion (anbn)  if.. then…/ either… or…
 Center embedding/mirror recursion (wwR) 
 Cross-serial dependencies/identity recursion (ww)
Christiansen & Charter (1999) propose a CONNECTIONIST MODEL to account for people’s limited ability to
process recursion without invoking the competence/performance distinction; human performance
emerges from intrinsic constraints on the performance of the SRN. They are not interested in modelling
unbounded recursion per se, but rather to model human performance.
The answer is yes. Let’s look at the demands of learning each of the recursive structures:
 Counting Recursion: should be easiest since it does not have agreement constraints imposed on it
 Centre embedding: develop a last-in-first-out memory/stack to store agreement information
 Cross-dependency: first-in-first-out memory/queue
 Right branching: does not involve unbounded memory load
i) Bach et al. (1986) found that cross-dependencies in Dutch were easier to process than centre-
embeddings in German. This is interesting since CD cannot be captured by PSG rules and are typically
viewed as more complex because of this.  SRN performance fits human performance on similar
constructions & this tells us that there is something weird in Chomsky’s hierarchy to predict complexity
ii) Thomas (1997) suggest that some ungrammatical sentences involving doubly centre-embedded object
relative clauses may be perceived as grammatical  SRN replicate human results
iii) increasing processing difficulty with increase of depth of right recursion, although at a lesser degree
than for complex recursive constructions  replicated in SRN
iv) counting recursion: Christiansen & Charter argue that such structures may not exist in natural
language; rather they may be closer to center-embedded constructions – eg. if-then pairs are nested in a
center-embedding order
How can a computer learn to solve NLP tasks without being explicitly programmed?
In general, while in symbolic approaches we have a system which we explicitly program with a set of rules
in order to do something, with MACHINE LEARNING we need no explicit programming. ML characterizes
any mechanical device which changes behaviour on the basis of experience to better perform in the future
 Data mining: from data (registration of facts) to information (patterns underlying the data); it is the
application of machine learning (algorithm)
 A computer program is said to learn from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P, improves with experience E.
With ANNs, a network can eventually learn by itself how to solve NLP tasks without explicit teaching.
Elman (1993) gives an example of how this might be done, by making the network learn how to predict the
next word given previous context:
 COMPLEX REGIMEN: you start training the network on simple sentences and gradually
increase the number of complex sentences, so that in the last round of learning you only have
complex sentences.
It is efficient, able to generalize its performance to novel sentences.
However, it does not represent how human children learn: children DO start to learn the
simplest structures until they achieve the adult language, but they ARE NOT placed in an
environment in which they initially only encounter simple sentences; they hear samples of all
aspects of human language from the beginning
 LESIONING (modification of memory of the network every epoch): the child changes during his
period of learning: working memory and attention span are initially limited and increase over
time. We can model this in an ANN that is trained on complex data from the beginning, but
context units are reset to random values after every 2/3 words, and then the “working
memory” is gradually extended until there is no more deletions.
We obtain just as good results with all the data available at the beginning; it allows the
network to focus on simpler/local relations first
Discuss the vectorial representation of a document (in our case study a picture description) adding extra
features representing hesitation, false starts and complexity cues in phrase structure
Assumption: words can be defined by their distribution in language use, that is their neighbouring words
or grammatical environments. We can use this idea to define meaning as a point in space based on
distribution: VECTORS are n-dimensional entities which can represent sentences in a measurable space,
which makes comparison possible.
We can not only plot frequencies alone (like in the bag-of-words approach), but also n-grams, POS
annotations, number of syllables, syntag_breaks (number of hesitations after a functional word),
repetitions (n of duplicated syllabic patterns) and false starts before content words. We can take account
of these features and see if they introduce an Information Gain. If so, we can plot them in a vector.
But where to draw a line? We can try to find a SUPPORT VECTOR (at an equal distance from the others)
and then calculate the distance between the two, or rather use a DECISION TREE, by using a feature and
then on its basis expand the tree considering other features until we get a classification.
Discuss the classification task based on bag-of-words approach

The BAG-OF-WORDS is a purely statistical approach: a text is represented as the multiset of its words,
disregarding grammar and even word order but just keeping frequencies. We have a vector composed of
the list of frequencies, each corresponding to the presence/absence of a specific word/feature in the text.
Garrard et al. (2010) show that this approach, without any syntactic analysis, is already able to discriminate
between Semantic Dementia patients and controls.
However, sometimes bare frequencies may not be enough, because not all features equally contribute to
the discrimination task: the whole vector may be overfitted on the training data. A problem in Machine
Learning is in fact to try a way to learn something by using the fewest possible information/features that
may possibly generalize on the new data.
Another way to look at the problem would be to try and find the most characteristic features for our task.
Information Gain (IG) may be useful in this: the basic idea is that the more features contribute to
discrimination, the higher our IG; the more equally distributed, the higher the total info. (IG= infoTOT-
infoCODING). By doing so, it has been found that syntag_breaks (number of hesitations after a functional
word) and syllabic repetitions are particularly relevant for the discrimination task
Other extra topics

These are the "bonus lectures" we did not have time to discuss in class.
These topics won't be discussed at the oral exams, unless you don't want to bring them as a personal
follow-up.
 Machine Translation (see Chapter 25 of Jurafsky & Martin & Hutchins & Somers attached here)
 Language in the brain and its complexity (see Chesi & Moro 2014 attached here)

Computational Linguistics Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Linguistics Notes

Uploaded by

Copyright:

Available Formats

1.

How big (and balanced) should it be to be representative of a specific linguistic reality?

What kind of annotation is necessary to add linguistic value?

How do we query corpora according to their level of annotation?

2. Formal Grammars (morphology and syntax)

(--------- GENERATIVE CAPACITY: set of sentences that can be generated

Adequacy of a grammar, 3 levels (Chomsky, 1965)

What does this rule mean? S -> NP VP

(--------------------REGULAR EXPRESSIONS: algebraic notations to express sets of strings, used to identify a

PUSH DOWN AUTOMATA (PDA) is an FSA endowed with a

CFG AND PDA both describe CONTEXT-FREE LANGUAGES)

What do Pumping Lemmas tell us?

Describes a property of RLanguages: every string in the language will have a

Describes a property of CFLanguages: for every CFLanguage L, it is possible to find

Why are Context-Free Grammars "more powerful" that Regular Grammars?

Are Context-Free Grammars sufficient to express any Natural Language property?

3. PARSING AND COMPLEXITY

What's the Two-Levels morphology model?

An FST is a 5-tuple <Q,Σ, qo, F,δ>

(-----------PROBLEMS with FST:

What does it mean to "parse a sentence"?

How can Context-Free Grammars be used in parsing?

Does the size of the sentence matters?

How does Left-Corner Algorithm operate?

How does Earley parsing operate?

The charts include 3 levels of information:

The EARLEY ALGORITHM is composed of 3 operations

What's a Minimalist Grammar?

How do you estimate complexity in terms of intervention/locality?

We can also derive Object Clefts top-down (PMG): in CUE-

4. Language in the brain and Machine Learning

Can we "replicate" brain activity using artificial networks?

An ARTIFICIAL NEURAL NETWORK (ANN) is an

(-------------CLASSIC ANN examples:

Two options to train an ANN

TIME in ANN → processing flow

This pattern can be replicated in an ANN with a human-like performance.

What kind of artificial network architectures better fits linguistic needs?

(SRN performance is independent of number of hidden units:

It has the same architecture of a SRN, but the hidden layer

Can recursion be learned by SRN?

Discuss the classification task based on bag-of-words approach

Other extra topics

You might also like