Professional Documents
Culture Documents
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 1/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
The most common way of representing phones is to use special sound-alphabets, e.g., International
Phonetic Alphabet (IPA) [charts], ARPAbet [Wikipedia] (J&M, Table 7.1).
Each language has its own characteristic repertoire of sounds, and no natural language (let alone
English) uses the full range of possible sounds.
Example: Xhosa tongue twisters (clicks)
Even within a language, sounds can vary depending on their context, due to constraints on the way
articulators can move in the vocal tract when progressing between adjacent sounds (both within and
between words) in an utterance.
word-final [t]/[d] palatalization (J&M, Table 7.10), e.g., "set your", "not yet", "did you"
Word-final [t]/[d] deletion (J&M, Table 7.10) e.g., "find him", "and we", "draft the"
Variation in sounds can also depend on a variety of other factors, e.g., rate of speech, word frequency,
speaker's state of mind / gender / class / geographical location (J&M, Section 7.3.3).
As if all this wasn't bad enough, it is often very difficult to isolate individual sounds and words from
acoustic speech signals
The Characteristics of Natural Language: Phonology (Bird (2003)) (Youtube)
Phonology is the study of the systematic and allowable ways in which sounds are realized and can occur
together in human languages.
Each language has its own set of semantically-indistinguishable sound-variants, e.g., [pat] and [p^hat]
are the same word in English but different words in Hindi. These semantically-indistinguishable variants
of a sound in a language are grouped together as phonemes.
Variation in how phonemes are realized as phones is often systematic, e.g., formation of plurals in
English:
Note that the phonetic form of the plural-morpheme /s/ is a function of the last sound (and in particular,
the voicing of the last sound) in the word being pluralized. A similar voicing of /s/ often (but not always)
occurs between vowels, e.g., "Stasi" vs. "Streisand".
Such systematicity may involve non-adjacent sounds, e.g., vowel harmony in the formation of plurals in
Turkish (Kenstowicz (1994), p. 25):
The form of the vowel in the plural-morpheme /lar/ is a function of the vowel in the word being
pluralized.
It is tempting to think that such variation is encoded in the lexicon, such that each possible form of a
word and its pronunciation are stored. However, such variation is typically productive wrt new words
(e.g., nonsense words, loanwords), which suggests that variation is instead the result of processes that
transform abstract underlying lexical forms to concrete surface pronounced forms.
Plural of nonsense words in English, e.g., blicket / blickets, farg / fargs, klis / klisses.
Unlimited concatenated morpheme vowel Harmony in Turkish (Sproat (1992), p. 44):
c"opl"uklerimizdekilerdenmiydi =>
c"op + l"uk + ler + imiz + de + ki + ler + den + mi + y + di
``was it from those that were in our garbage cans?''
In Turkish, complete utterances can consist of a single word in which the subject of the utterance
is a root-morpheme (in this case, [c"op], "garbage") and all other (usually syntactic, in languages
like English) relations are indicated by suffix-morphemes. As noted above, the vowels in the
suffix morphemes are all subject to vowel harmony. Given the in principle unbounded number of
possible word-utterance in Turkish, it is impossible to store them (let alone their versions as
modified by vowel harmony) in a lexicon.
dosutoru ``duster''
sutoroberri` `strawberry''
kurippaa` `clippers''
sutoraiki ``strike''
katsuretsu ``cutlet''
parusu ``pulse''
gurafu ``graph''
Japanese has a single phoneme /r/ for the liquid-phones [l] and [r]; moreover, it also allows only
very restricted types of multi-consonant clusters. Hence, when words that violate these constraints
are borrowed from another language, those words are changed by the modification to [r] or
deletion of [l] and the insertion of vowels to break up invalid multi-consonant clusters.
Each language has its own phonology, which consists of the phonemes of that language, constraints on
allowable sequences of phonemes (phonotactics), and descriptions of how phonemes are instantiated as
phones in particular contexts. The systematicities within a language's phonology must be consistent with
(but are by no means limited to those dictated solely by) vocal-tract articulator physics, e.g., consonant
clusters in Japanese.
Courtesy of underlying phonological representations and processes, one is never sure if an observed
phonetic representation corresponds directly to the underlying representation or is some process-
mediated modification of that representation, e.g., we are never sure if an observed phone corresponds to
an "identical" phoneme in the underlying representation. This introduces one type of ambiguity into
natural language processing.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 4/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Morphology is the study of the systematic and allowable ways in which symbol-string/meaning pieces
are organized into words in human languages.
A symbol-string/meaning-piece is called a morpheme.
Two broad classes of morphemes: roots and affixes.
A root is a morpheme that can exist alone as word or is the meaning-core of a word, e.g., tree
(noun), walk (verb), bright (adjective).
An affix is a morpheme that cannot exist independently as a word, and only appears in language
as part of word, e.g., -s (plural), -ed (past tense; 3rd person), -ness (nominalizer).
A word is essentially a root combined with zero or more affixes. Depending on the type of root, the
affixes perform particular functions, e.g., affixes mark plurals in nouns and subject number and tense in
verbs in English.
Morphemes are language-specific and are stored in a language's lexicon. The morphology of a language
consists of a lexicon and a specification of how morphemes are combined to form words
(morphotactics).
Morpheme order typically matters, e.g., uncommonly, commonunly*, unlycommon* (English)
There are a number of ways in which roots and affixes can be combined in human languages (Trost
(2003), Sections 2.4.2 and 2.4.3):
Prefix: An affix attached to the front of the root, e.g.,, the negative marker un- for adjectives in
English (uncommon, infeasible, immature).
Suffix: An affix attached to the back of the root, e.g.,, the plural marker -s for nouns in English
(pots, pods, dishes).
Circumfix: A prefix-suffix pair that must both attach to the root, e.g.,, the past participle marker
ge-/-t for verbs in German (gesagt "said", gelaufen "ran").
Infix: An affix inserted at a specific position in a root, e.g., the -um- verbalizer for nouns and
adjectives in Bontoc (Philippines):
Template Infix: An affix consisting of a sequence of elements that are inserted at specific
positions into a root (root-and-template morphology), e.g., active and passive markers -a-a- and
-u-i- for the root ktb ("write") in Arabic:
Reduplication: An affix consisting of a whole or partial copy of the root that can be prefix, infix,
or suffix to the root, e.g., formation of the habitual-repetitive in Javanese:
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 5/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
As with phonological variation, there are several lines of evidence which suggest that morphological
variation is not purely stored in the lexicon but rather the result of processes operating on underlying
forms:
Productive morphological combination simulating complete utterances in words, e.g., Turkish
example above.
Morphology operating over new words in a language, e.g., blicket -> blickets, television ->
televise -> televised / televising, barbacoa (Spanish) -> barbecue (English) -> barbecues.
As if all this didn't make things difficult enough, different morphemes need not have different surface
forms, e.g, variants of "book"
Courtesy of phonological transformations operating both within and between morphemes and the non-
uniqueness of surface forms noted above, one is never sure if an observed surface representation
corresponds directly to the underlying representation or is a modification of that representation, e.g., is
[blints] the plural of "blint" or does it refer to a traditional Jewish cheese-stuffed pancake? This
introduces another type of ambiguity into natural language processing.
The Characteristics of Natural Language: Syntax (BKL, Section 8; J&M, Chapter 12; Kaplan (2003))
Syntax is the study of the systematic and allowable ways in which words are organized into utterances
(spoken) and sentences (written) in human languages.
Two fundamental facts of natural language syntax (J&M, pp. 385-386):
constituency: a group of one or more (most often, adjacent) words may behave as a single unit
called a constituent, e.g., "fox", "dog", "jumped", "the quick brown fox", "the lazy dog", and
"jumped over the lazy dog" are constituents in the utterance "the quick brown fox jumped over the
lazy dog".
grammatical relations: constituents relate in systematic manners to other constituents in a
utterance, e.g. "the quick brown fox" and "the lazy dog" are the subject and object, respectively, in
the utterance "The quick brown fox jumped over the lazy dog".
Constituents typically have associated grammatical classes whose members may have different
structures but perform equivalent grammatical functions, e.g., noun phrase, determiner, verb phrase.
There are typically multiple hierarchical levels of constituents in an utterance, e.g.,
Some languages encode grammatical relations primarily by word order, e.g., "John visited Mary"
(English); others primarily use morphology, e.g., "John-ga Mary-o tazune-ta", "Mary-o John-ga tazune-
ta" (Japanese) (Kaplan (2003), p. 71).
Grammatical relations can hold over arbitrary distances in a utterances, e.g., "The man who knew the
solder that killed the sailor who sailed as first mate on the S.S. Warsaw and visited my ex-sister-in-law
Louise's best friend in Paris last month died." (garden-path utterances)
Grammatical relations may even be recursively embedded, e.g., "The cat the dog the rat the elephant
admired bit chased likes tuna fish." (J&M, p. 536)
Syntax is language-specific. Building on a language's morphology, the syntax is a specification called a
grammar of how words are combined to form utterances in that language.
Analogous to phonological and morphological variation, the fact that syntax operates productively to
generate a potentially infinite number of utterances over a finite lexicon suggests that syntactic patterns
are not purely stored in the grammar but are rather the result of grammatical processes operating on
underlying forms.
As if all this didn't make things difficult enough, utterances expressing very different meanings need not
have different surface forms, e.g,
Hence, ambiguity inherent in grammars and lexicons adds yet another type of ambiguity into natural
language processing (BKL, pp. 317-318).
The Characteristics of Natural Language: Semantics, Discourse, and Pragmatics (Lappin (2003); Leech and
Weisser (2003); Ramsay (2003))
Semantics is the study of the manner in which meaning is associated with utterances in human language;
discourse and pragmatics focus, respectively, on how meaning is maintained and modified over the
course of multi-person dialogues and how these people chose different individual-utterance and dialogue
styles to communicate effectively.
Meaning seems to be very closely related to syntactic structure in individual utterances; however, the
meaning of an utterance can vary dramatically depending on the spatio-temporal nature of the discourse
and the goals of the communicators, e.g., "It's cold outside." (statement of fact spoken in Hawaii;
statement of fact spoken on the International Space Station; implicit order to close window).
Various mechanisms are used to maintain and direct focus within an ongoing discourse (Ramsay (2003),
Section 6.4):
Different syntactic variants with subtly different meanings, e.g., "Ralph stole my bike" vs "My
bike was stolen by Ralph".
Different utterance intonation-emphasis, e.g., "I didn't steal your bike" vs "I didn't steal your
BIKE" vs "I didn't steal YOUR bike" vs "I didn't STEAL your bike".
Syntactic variants which presuppose a particular premise, e.g., "How long have you been beating
your wife?"
Syntactic variants which imply something by not explicitly mentioning it, e.g., "Some people left
the party at midnight" (-> and some of them didn't), "I believe that she loves me" (-> but I'm not
sure that she does).
Another mechanism for structuring discourse is to use references (anaphora) to previously discussed
entities (Mitkov (2003a)).
There are many kinds of anaphora (Mitkov (2003a), Section 14.1.2):
Pronominal anaphora, e.g., "A knee jerked between Ralph's legs and he fell sideways
busying himself with his pain as the fight rolled over him."
Adverb anaphora, e.g., "We shall go to McDonald's and meet you there."
Zero anaphora, e.g., "Amy looked at her test score but was disappointed with the results."
Though a convenient conversational shorthand, anaphora can be (if not carefully used)
ambiguous, e.g.,
"The man stared at the male wolf. He salivated at the thought of his next meal."
"Place part A on Assembly B. Slide it to the right."
"Put the doohickey by the whatchamacallit over there."
As demonstrated above, utterance meaning depends not only the individual utterances, but on the
context in which those utterance occur (including knowledge of both the past utterances in a dialogue
and the possibly unknown and dynamic goals and knowledge of all participants in the discourse), which
adds yet another layer of ambiguity into natural language processing ...
It seems then that natural language, by virtue of its structure and use, encodes both a lot of ambiguity and
variation, as well as a wide variety of structures at all levels. This causes massive problems for artificial NLP
systems, but humans handle it with surprising ease.
Example: Prototypical natural language acquisition device
Given all the characteristics of natural language discussed in the previous lectures and their explicit and
implied constraints on what an NLP system must do, what then are appropriate computational mechanisms for
implementing NLP systems? It is appropriate to consider first what linguists, the folk who have been studying
natural language for the longest time, have to say on this matter.
NLP Mechanisms: The View from Linguistics
Given that linguistic signals are expressed as temporal (acoustic and signed speech) and spatial (written
sequences) sequences of elements, there must be a way of representing such sequences.
Sequence elements can be atomic (e.g., symbols) or have their own internal structure (e.g., feature
matrices, form-meaning bundles (morphemes)); for simplicity, assume for now that elements are
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 7/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
symbols.
There are at least two types of such sequences representing underlying and surface forms.
Where necessary, hierarchical levels of structure such as syntactic parse trees can be encoded as
sequences by using appropriate interpolated and nested brackets, e.g., "[[the quick brown fox]
[[jumped over] [the lazy brown dog]]]"
The existence of lexicons implies mechanisms for representing sets of element-sequences, as well as
accessing and modifying the members of those sets.
The various processes operating between underlying and surface forms presuppose mechanisms
implementing those processes.
The most popular implementation of processes is as rules that specify transformations of one form
to another form, e.g., add voice to the noun-final plural morpheme /s/ if the last sound in the noun
is voiced.
Rules are applied in a specified order to transform an underlying form to its associated surface
form for utterance production, and in reverse fashion to transform a surface form to its associated
underlying form(s) for utterance comprehension (ambiguity creating the possibility of multiple
such associated forms).
Rules can be viewed as functions. Given the various types of ambiguity we have seen, these
functions are at least one-many.
Each type of process (e.g., phonology, morphology, syntax, semantics) can be seen as separate functions that
are applied consecutively; alternatively, Natural language can be seen as a single function that is the
composition of all of these individual functions. The latter is the view taken by many neural network
implementations of NLP (which will be discussed later in this module).
Great care must be taken in establishing what parameters are necessary as input to these functions and
that such parameters are available. For example, a syntax function can get by with a morphological analysis
of an utterance, but a semantics function would seem to require as input not only possible syntactic analyses of
an utterance but also discourse context and models of discourse participant knowledge and intentions.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 8/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Types of FSA:
Deterministic (DFA): At each state and for each symbol in the alphabet, there is at most one
transition from that state labeled with that symbol.
Non-Deterministic (NFA): At each state, there may be more than one transition from that state
labeled with a particular symbol and/or there may be transitions labeled with special symbol
epsilon (denoted in our diagrams by symbol E).
Revised notion of string acceptance: see if there is any path through the NFA that accepts
the input string.
Example #2: An NFA for the set of all strings over the alphabet {a,b} that start with an a and end with a
b (uses multiple same-label outward transitions).
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 9/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
DFA can recognize strings in time linear in the length of the input string, but may not be compact; NFA
are compact but may require time exponential in the length of a string to recognize that string (need to
follow all possible computational paths to check string acceptance).
Example #3: An NFA for the set of all strings over the alphabet {a,b} that start with an a and end with a
b (uses epsilon-transitions).
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 10/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Oddly enough, deterministic and non-deterministic FSA are equivalent in recognition power, i.e., there
is a DFA for a particular set of strings iff there is an NFA for that set of strings.
Application: Building lexicons
The simplest types of lexicons are lists of morphemes which, for each morpheme, store the
surface form and syntactic / semantic information associated with that morpheme.
A list of surface forms can be stored as FSA in two ways:
DFA encoding of a trie (= retrieval tree)
Deterministic acyclic finite automaton (DAFA)
One creates a trie by compacting a word-list and a DAFA for a word-list by compacting a trie-
DFA for that word-list.
Example #4: A trie-DFA encoding the lexicon {"in", "inline", "input", "our", "out", "outline",
"output"}.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 11/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Example #5: A DAFA encoding the lexicon {"in", "inline", "input", "our", "out", "outline",
"output"}.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 12/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
The simplest types of purely concatenative morphotactics can be encoded as a finite-state manner
by indicating for each morpheme-type sublexicon which types of morphemes can precede and
follow morphemes in that sublexicon, e.g., nouns precede pluralization-morphemes in English.
More complex morphotactics, e.g., circumfixes, infixes, reduplication, can be layered on top of
lexicon FSA using additional mechanisms (Beesley and Karttunen (2000); Beesley and Karttunen
(2003), Chapter 8; Kiraz (2000)).
NLP Mechanisms: Finite-state transducers (FST) (J&M, Section 3.4, Mohri (1997); Roche and Schabes
(1997a), Section 1.3)
Generalization of FSA in which each transition is labeled with a symbol-pair (either or both of which
may be the special symbol epsilon).
The first symbol in each pair is called the lower symbol and the second is called the upper symbol.
The alphabet of all lower-symbols in an FST need not have a non-empty intersection with the
alphabet of all upper-symbols in an FST.
Example #1: An FST for transforming between the set of all strings over the alphabet {a,b} that consist
of n, n >= 0, concatenated copies of ab and the set of all strings over the alphabet {c,d} that consist of n,
n >= 0, concatenated copies of cd.
Operation: generation / recognition of string-pairs, reconstruction of upper (lower) string associated with
given lower (upper) string.
Generation of a string-pair builds that pair from left to right, adding symbol-pairs to the right-hand
end of the string-pair as one progresses along a transition-path from the start state to a final state.
Recognition of a string-pair proceeds by stepping through that pair from left to right, deleting
symbol-pairs from the left-hand end of the string-pair as one progresses along a transition-path
from the start state to a final state.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 13/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Reconstruction of a string-pair associated with a given string builds the missing string of the pair
from left to right, adding missing-string symbols to the right-hand end of the missing string as one
progresses along a transition-path from the start state to a final state in accordance with the given
string.
As the given string may be either the lower or upper string, there are two types of string-pair
reconstructions.
Each reconstructed string-pair corresponds to a particular path through the FST guided by
the given string.
Depending on the structure of the FST, there may be more than one path through the FST
consistent with the given string, and hence more than one string-pair reconstruction.
Note that one need only keep track of one FST-state during all operations described above,
namely, the last state visited.
Unlike FSA, there are many types of FST and many types of FST (non-)determinism.
Example #2: An FST for transforming between the set of all strings over the alphabet {a,b} that consist
of n, n >= 0, concatenated copies of ab and the set of all strings over the alphabet {a,b} that consist of n,
n >= 0, concatenated copies of aba.
Example #3: An FST implementing the rule that voice be added to the noun-final plural morpheme /s/ if
the last sound in the noun is voiced.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 14/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Example #2: A CFG for an infinite-size subset of the set of all recursively-embedded utterances.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 16/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Tuesday, January 30
In-class Exam Notes
I've finished making up the in-class exam. The exam will be closed-book and written in-class on paper (please
bring your own pens). It will be 50 minutes long and has a total of 50 marks (this is not coincidental; I have
tried to make the number of marks for a question approximately equal to the number of minutes it should take
you to do it). The exam will cover material in all Module 2 lectures. There will be two questions, each with
multiple parts. The distribution of question-parts and marks by topic are as follows:
I hope the above helps, and I wish you all the best of luck with this exam.
Unlike the other sample grammars above, this grammar is structurally ambiguous because there may be
more than one parse for certain utterances involving prepositional phrases (Pp) as it is not obvious which
noun phrase (Np) a Pp is attached to, e.g., in "the dog saw the man in the park", is the man (above) or
the dog (below) in the park?
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 18/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Note that the parse trees encode grammatical relations between entities in the utterances, and that these
relations have associated semantics; hence, one can use parse trees as encodings of basic utterance
meaning!
As shown in Example #3 above, parse trees via structural ambiguity can nicely encode semantic
ambiguity.
Context-free mechanisms can handle many complex linguistic phenomena like unbounded numbers of
recursively-embedded long-distance dependencies. This done by parsing, which both recognizes and
adds an internal hierarchical constituent phrase-structure of a sentence.
Parsing a sentence S with n words relative to a grammar G can be seen as a search over all
possible parse trees that can be generated by grammar G with the goal of finding all parse trees
whose leaves are labelled with exactly the words in S in an order that (depending on the language,
is either exactly or is consistent with) that in S.
Though some algorithms implement the generation of one valid parse for a given sentence much quicker
than others, the generation of all valid search trees (which is required in the simplest human-guided
schemes for resolving sentence ambiguities) for all parsing algorithms is in the worst case at least the
number of valid parse trees for the given sentence S relative to the given grammar G.
There are natural grammars for which this is quantity is exponential in the number of words in the
given sentence (BKL, p. 317; Carpenter (2003), Section 9.4.2).
Dealing with very large numbers of output derivation-trees is problematic. What can be done? Can deal
with this by using probability -- that is, by only computing the derivation-tree with highest probability
for the given utterance!
This is a general way to handle the many-results problem with ambiguous linguistic processes. There are
many models of probabilistic computing with context-free grammars and finite-state automata and
transducers, and many of these were state-of-the-art for NLP for several decades. However, as Li (2022)
points out, the more purely mathematical probabilistic modeling of natural language started over a
century ago with Markov Models is actually the research tradition that leads to the final NLP
implementation mechanism we shall examine -- namely, (artificial) neural networks.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 19/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
NLP Mechanisms: Neural Networks (Kedia and Rasu (2020), Chapters 8-11)
The finite-state and context-free NLP mechanisms we have studied thus far in the module are all
process-oriented, as they all implement rules (explicitly in the case of grammars, implicitly via the state-
transitions in automata and transducers) that correspond to processes postulated by linguists on the basis
of observed human utterances.
Neural network (NN) NLP mechanisms, in contrast, are function-oriented, in that the components of
these mechanisms typically do not have linguistic-postulated correlates and the focus is instead on
inferring (with the assistance of massive amounts of observed human utterances) various functions
associated with NLP.
These functions map linguistic sequences onto categories, e.g., spam e-mail detection, sentiment
analysis, or other linguistic sequences, e.g., Part-of-Speech (PoS) tagging, chatbot responses,
machine translation between human languages.
In many cases, these linguistic sequences are recodings of human utterances in terms of multi-
dimensional numerical vectors or matrices. There a variety of methods for creating these vectors
and matrices, e.g., bag of words, n-grams, word / sentence / document embeddings (Kedia and
Rasu (2020), Chapters 4-6; Vajjala et al (2020), Chapter 3).
First Wave (1943-1968): McCulloch and Pitts propose abstract neurons in 1943. Starting in the
late 1950s, Rosenblatt explores the possibilities for representing and learning functions relative to
single abstract neurons (which he calls perceptrons). The mathematical principles underlying the
back propagation procedure for training neural networks are developed in the early 1960's
(Section 5.5, Schmidhuber (2015)). Perceptron research is killed off by the publication in 1968 of
Minsky and Papert's monograph Perceptrons, in which they show by rigorous mathematical proof
that perceptrons are incapable of representing many basic mathematical functions such as
Exclusive OR (XOR).
Second Wave (1980-1990): Rumelhart, McClelland, and colleagues propose and explore the
possibilities for multi-level feed-forward neural networks incorporating hidden layers of artificial
neurons. This is aided immensely by the (re)discovery of the backpropagation procedure, which
allows efficient learning of arbitrary functions by multi-layer neural networks. Though it is shown
that these networks are powerful (Universal Approximation Theorems state that any well-behaved
function can be approximated to an arbitrarily close degree by a neural network with only one
hidden layer), research remains academic as backpropagation on even small to moderate-size
networks is too data- and computation-intensive.
It is during this period that NN-based NLP research flourishes, taking up over a third of the
second volume of the 1987 summary work Parallel Distributed Processing (PDP).
Alternatives to feed-forward NN (Recurrent Neural Networks (see below)) for NLP are also
explored by Elman starting in 1990.
Third Wave (2000-now): With the availability of massive amounts of data and computational
power courtesy of the Internet and Moore's Law (as instantiated in special-purpose processors like
GPUs), neural network research re-ignites in a number of areas, starting with image processing
and computer vision. This is aided by the development of neural networks incorporating special
structures inspired by structures in the human brain, e.g., Long Short-Term Memory cells (see
below). Starting around 2010, this wave reaches NLP; the results are so spectacular that by 2018,
NN-based NLP techniques are state of the art in many applications.
This is in large part because the more complex types of neural networks have enabled the
creation of pre-trained NLP models that can subsequently be customized with relatively
little training data for particular applications, e.g., question answering systems (Li (2022)).
Let us now explore this research specific to NLP by examining the various types of neural networks that
have emerged during these three waves.
Feed-forward Multi-layer Neural Networks (FF-NN) (Kedia and Rasu (2020), Chapter 8)
In FF-NN, each neuron has n inputs x1, x2, ..., xn, input-specific weights w1, w2, ..., wn, a bias-term b,
an activation function f(), and an output y = f((x1 * w1) + (x2 * w2) + .... + (xn * wn) + b).
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 20/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
In a multi-layer FF-NN, there is an input layer consisting of n inputs, one or more hidden layers of
possibly different numbers of artificial neurons, and an output layer consisting of one or more single
artificial neurons. Each layer is fully connected to the next, and there are no connections between pairs
of neurons in the same layer.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 21/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Once backpropagation has converged on good weight and bias values for the FF-NN relative to the
training set, the FF-NN is typically evaluated relative to a separate test set to see how well the FF-NN's
behaviour generalizes.
FF-NN may exhibit underfitting (performance is bad on both the training and test sets) or
overfitting (performance is good on the training set but bad on the test set, i.e., the FF-NN's
behaviour does not generalize).
Underfitting is dealt with by modifications to the FF-NN's architecture, e.g., number of hidden
layers, number of neurons per layer.
There are a variety of techniques for dealing with overfitting, e.g.,
more training data (add more input-output pairs to the training set).
regularization (add a term to penalize too good a fit between the generated and wanted
output).
early stopping (stop training before weight- and bias-value convergence has occurred).
dropout (remove the effects of some percentage of randomly-selected neurons in the FF-
NN in each training run).
Courtesy of the Universal Approximation Theorems, it is known that FF-NN are powerful in theory;
however, converging on the promised network weights and bias in practice by procedures like
backpropagation is difficult for several reasons:
Large layers and full connections between them require both lots of training data and very large
(often exponential) numbers of training runs to converge, cf., so-called "few-shot" learning of
natural language by human beings.
When layers are reduced in size by having many smaller layers, the values of the loss-function
gradients exploited by backpropagation may explode or vanish, leading to wild swings in weight
and bias values across training runs or no changes at all in weight and bias values in layers further
from the output layer.
These problems have been dealt with in more advanced types of neural networks by adding additional
structures and modules in the network architecture such that embedded FF-NN are smaller and flatter.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 22/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
RNN can be trained by backpropagation that moves backwards in time rather than layers
from the output, where the initial hidden state consists of random values. This is best
visualized relative in the unrolled version. That being said, it is important to note that the
only FF-NN whose weights and biases are changed is that corresponding to timestep zero --
in all other cells, only errors and gradients are propagated.
There are several types of RNN, each of which is used in particular applications (shown
below in their unrolled versions). These types vary in the relationship between the lengths
of their input and output sequences.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 23/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Fixed-length context vectors are problematic if target sequences that must be produced by
the decoder are longer than the source sequences used to create context vectors in the
encoder. This has been mitigated by incorporating an attention mechanism into the encoder /
decoder architecture.
In this mechanism, all (not just the final) hidden states produced by the encoder are
available to every cell in the decoder.
This set of hidden states is then weighted in a manner specific to each decoder cell to
indicate the relevance of each encoder-input token to that cell's output token.
Essentially, this enables decoder cells to have individually-tailored variable-length
context vectors.
Transformers (Kedia and Rasu (2020), Chapter 11)
The Transformer architecture, first proposed in 2017, retains the notion of encoders and
decoders but discards the many-many RNN backbone. The encoders and decoders are
placed in an encoder stack and a decoder stack, respectively.
Starting 2018, Transformers have become the state-of-the-art NLP techniques courtesy of
two frameworks, BERT and GPT, both of which pre-train generic Transformer-based
systems for a specific language that can subsequently by fine-tuned for particular
applications.
BERT (Bidirectional Encoder Representation from Transformers)
Created at Google in 2018.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 24/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 25/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Indeed, the above suggests that rule-based NLP systems may be much more applicable than is
sometimes thought, and that a fusion of NN- and rule-based NLP system components (or a
fundamental re-thinking of NLP system architecture that incorporates the best features of both)
may be necessary to create the practical real-world NLP systems of the future.
Applications: Partial Morphological and Syntactic Parsing
As we have seen in previous lectures, complete morphological parsing of words can be done using
finite-state mechanisms and complete syntactic parsing of utterances can be done using context-free
mechanisms. However, there are a number of applications in which partial ("shallow") morphological
and syntactic parsing are desirable.
Stemmers and Lemmatizers (Sub-word Morphological Parsing) (BKL, Section 3.6; J&M, Section 3.8)
Given a word, a stemmer removes known morphological affixes to give a basic word stem, e.g.,
foxes, deconstructivism, uncool => fox, construct, cool; given a word which may have multiple
forms, a lemmatizer gives the canonical / basic form of that word, e.g., sing, sang, sung => sing.
Stemmers and lemmatizers are used to "normalize" texts as a prelude to selecting content words
for a text or selecting general word-forms for searches, e.g., process, processes, processing =>
process, in information retrieval.
There are a variety of stemmers implementing different degrees of affix removal and hence
different levels of generalization. It is important to chose the right stemmer for an applications,
e.g., if "stocks" and "stockings" are both stemmed to "stock", financial and footwear queries will
return the same answers.
Part-of-Speech (POS) Tagging (Single-level Syntactic Parsing) (BKL, Chapter 5; J&M, Chapter 5)
Given a word in an utterance, a POS tagger returns a guess for the POS of that word relative to a
fixed set of POS tags.
The output of POS taggers are used as inputs to other partial parsing algorithms (see below),
automated speech generation (to help determine pronunciation, e.g., "use" (noun) vs. "use"
(verb)), and various information retrieval algorithms (see below) (to help locate important noun or
verb content words).
Existing POS algorithms have differing performances, knowledge-base requirements, and running
times and space requirements. Hence, no POS tagger is good in all applications.
Chunking (Few-level Syntactic Parsing) (BKL, Chapter 7; J&M, Section 13.5)
Given an utterance, a chunker returns a non-overlapping (but not necessarily total in terms of
word coverage) breakdown of that utterance into a sequence of one or more basic phrases or
chunks, e.g., "book the flight through Houston" => [NP(the flight), PN(Houston)].
A chunk typically corresponds to a very low-level syntactic phrase such as a non-recursive NP,
VP, or PP.
Chunkers are used to isolate entities and relationships in texts as a pre-processing step for
information retrieval.
Week 3
Monday, February 5 (Lecture #6) (FS)
Applications: Language Understanding (Utterance) (BKL, Section 10; J&M, Section 17)
When considering an utterance in isolation from its surrounding discourse, the goal is to infer a semantic
representation of the utterance giving the literal meaning of the utterance.
Note that this is not always appropriate, e.g., "I think it's lovely that your sister and her five kids
will be staying with us for a month", "Would you like to stand in the corner?".
A commonly-used representation that satisfies most of the above requirements for many simple
applications is First-Order Logic (FOL) (propositional logic augmented with predicates and
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 26/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Even in the absence of lexical and grammatical ambiguity, ambiguity in meaning can arise
from ambiguous quantifier scoping, e.g., the two possible interpretations above of "every
restaurant has a menu" (J&M, Section 18.3).
The meanings of certain phrases and utterances in natural language are not compositional,
e.g., "roll of the dice", "grand slam", "I could eat a horse" (J&M, Section 18.6)
Many of these shortcomings can be mitigated using more complex representations and
representation-manipulation mechanisms; however, given the additional processing costs
associated with these more complex schemes, the best ways of encoding and manipulating the
semantics of utterances is still a very active area of research.
There are two types of multi-utterance discourses: monologues (a narrative on some topic presented by a
single speaker) and dialogues (a narrative on some topic involving two or more speakers). As the NLP
techniques used to handle these types of discourse are very different, we shall treat them separately.
Applications: Language Understanding (Discourse / Monologue) (J&M, Chapter 21)
The individual utterances in a monologue are woven together by two types of mechanisms:
Coherence: This encompasses various types of relations between utterances showing how
they contribute to a common topic, e.g. (J&M, Examples 21.4 and 21.5),
as well as use of sentence-forms that establish a common entity-focus, e.g. (J&M, Examples
21.6 and 21.7),
John went to his favorite musical store to buy a piano. He had frequented the store for
many years. He was excited that he could finally buy a piano. He arrived just as the
store was closing for the day.
John went to his favorite music store to buy a piano. It was a store that John had
frequented for many years. He was excited that he could finally buy a piano. It was
closing just as John arrived.
Ideally, reconstructing the full meaning of a monologue requires correct recognition of all
coherence and coreference relations in that monologue (creating a monologue parse graph, if you
will). This is exceptionally difficult to do as the low-level indicators of coherence and coreference
(for example, cue phrases (e.g., "because" "although") and pronouns, respectively) have multiple
possible uses and are thus ambiguous, and resolving this ambiguity accurately can be either very
computationally costly or even impossible.
Computational implementations of coherence and coreference recognition are often forced to rely
on heuristics, e.g., using on recency of mention and basic number / gender agreement to resolve
plurals (Hobb's Algorithm: J&M, Section 21.6.1).
Alternatively, one can weaken one's notion of what the meaning of a monologue is:
There are a very large number of narrow NLP techniques used to access these types of monologue
meaning; they are the subject of study in the subdisciplines of Information Retrieval
(Tzoukermann et al (2003); J&M, Section 23.1) and Information Extraction (BKL, Chapter 7;
Grishman (2003); J&M, Chapter 22) respectively.
Applications: Language Understanding (Discourse / Dialogue) (J&M, Chapters 23 and 24)
Even if one participant plays a more prominent role in terms of the amount speech or guiding the
ongoing narrative, every dialogue is at heart a joint activity among the participants.
Characteristics of dialogue (J&M, Section 24.1):
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 28/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Implemented SDS deal with this by restricting their functionality and domain of interaction and
retaining the primary initiative in the dialogue while allowing limited initiative on the part of
human users, e.g., travel management and planning SDS (J&M, pp. 811-813).
The simplest single-initiative DM have finite-state implementations, e.g.,
In addition to simplifying the DM, the restrictions above can also dramatically simplify the
required language understanding and generation abilities to handle only expected
interaction-types (J&M, p. 822).
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 29/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Alternatively, SDS can focus on maintaining the illusion of dialogue while having the minimum
(or even no) underlying mechanisms for modeling common ground or dialogue narrative:
Question Answering (QA) Systems (Harabagiou and Moldovan (2003); J&M, Chapter 23)
Such systems answer questions from users which require single-word or phrase
answers (factoids).
Given the limited form of most questions, the topic of a question can typically be
extracted from the question-utterance by very simple parsing (sometimes even by
pattern matching).
Factoids can then be extracted from relevant texts, where relevance is assessed using
techniques from Information Retrieval and Information Extraction.
Chatbots
Such systems engage in conversation with human beings, sometimes with some
explicit purpose in mind (e.g., telephone call routing (Lloyds Banking Group, Royal
Bank of Scotland, Renault Citroen)) but more often just as entertainment.
The first chatbots were ELIZA (Weizenbaum (1966): a simulation of a Rogerian
psychotherapist) and PARRY (Colby (1981): a simulation of a paranoid personality).
ELIZA maintains no conversational state, instead relying on strategic pattern-
matching of key phrases in user utterances and substitution of these phrases in
randomly-selected utterance-frames. PARRY maintains a minimal internal state
corresponding to the degree of the system's own "emotional agitation", which is
then used to modify selected user utterance-phrases and their response-frames.
There are many chatbots today, e.g.,
Eliza
Talk Bot
(see also The Personality Forge, The Chatterbot Collection, and character.ai). Though
many older chatbots rely on modifications of the mechanisms pioneered in ELIZA
and PARRY, modern chatbots are often built using Seq2Seq modeling as implemented
by neural network models (Vajjala et al (2020), Chapter 6).
Regardless of the simplicity of the mechanisms involved, chatbots can (with the co-
operation of the human beings that they interact with) give a startling illusion of
sentience (Epstein (2007); Tidy (2024); Weizenbaum (1967, 1979)). This is due in
large part to innate human abilities to extract order from the environment, e.g.,
hearing voices or seeing faces in random audio and visual noise, which (in the context
of interaction) assumes sentience and agency where none is present.
Current dialogue systems can give impressive illusions of understanding (especially recently-
developed conversational AIs based on large language models such as ChatGPT and Bing's AI
chatbot (Kelly (2023))) and do indeed have sentient ghosts within them -- however, for now, these
ghosts come from us (Bender and Shah (2022); Greengard (2023); Videla (2023)). Hence, these
systems are for all intents and purposes "glorified version[s] of the autocomplete feature on your
smartphone" (Mike Wooldridge, quoted in Gorvett (2023)), conflating the most probable as the
correct, and answers that are given to questions (in light of known problems of these systems wrt
factual consistency and citing appropriate sources for statements made) should be treated with
extreme caution (Du et al (2024); Dutta and Chakraborty (2023)).
Applications: Machine Translation (MT) (Hutchins (2003); J&M, Chapter 25; Somers (2003))
MT was among the first applications studied in NLP (J&M, p. 905). MT (and indeed, translation
in general) is difficult because of various deep morphosyntactic and lexical differences among
human languages, e.g., wall (English) vs. wand / mauer (German) (J&M, Section 25.1).
Three classical approaches to MT (J&M, Section 25.2):
Direct: A first pass on the given utterance directly translates words with the aid of a
bilingual dictionary, followed by a second pass in which various simple word re-ordering
rules are applied to create the translation.
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 30/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Transfer: A syntactic parse of the given utterance is modified to create a partial parse for
the translation, which is then (with the aid of a bilingual word and phrase dictionary) further
modified to create the translation. More accurate translation is possible if the parse of the
given utterance is augmented with basic semantic information.
Interlingua: The given utterance undergoes full morphosyntactic and semantic analysis to
create a purely semantic form, which is then processed in reverse relative to the target
language to create the translation.
The relations between these three approaches are often summarized in the Vauquois Triangle:
The Direct and Transfer approaches require detailed (albeit shallow) processing mechanisms for
each pair of source and target languages, and are hence best suited for one-one or one-many
translation applications, e.g., maintaining Canadian government documents in both English and
French, translating English technical manuals for world distribution. The Interlingua approach is
best suited for many-many translation applications, e.g., inter-translating documents among all 25
member states of the European Union.
In part because of the lack of progress, the Automated Language Processing Advisory Committee
(ALPAC) report recommended termination of MT funding in the 1960's. Research resumed in the
late 1970's, re-invigorated in large part because of advances in semantics processing in AI as well
as probabilistic techniques borrowed from ASR (J&M, p. 905).
Statistical MT uses inference over probabilistic translation and language models cf. acoustic and
language models in ASR.
The translation models assess the probability of a given utterance being paired with a
particular translation using the concept of word/phrase alignment.
Example #1: An alignment of an English utterance and its French translation:
To create the necessary probabilities (which are typically encoded in HMM), need large
large databases of valid (often manually created) alignments.
Fully Automatic High-Quality Translation (FAQHT) is the ultimate goal. This currently
achievable for translations relative to restricted domains, e.g., weather forecasts, basic
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 31/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
conversational phrases. Current MT (which is largely based on the statistical and, most recently,
neural network (Vajjala et al (w2020, pp. 265-268) approaches) is acceptable in larger domains if
rough translations are acceptable, e.g., as quick-and-dirty first drafts for subsequent human editing
(Computer-Aided Translation (CAT)). (J&M, pp. 861-862; see also BBC News (2014)) or
simultaneous speech translation. It seems likely that further improvements will require a fusion of
classical (in particular Interlingua) and statistical approaches.
Wednesday, February 7
In-class Exam
References
BBC News (2014) "Translation tech helps firms talk business round the world." (URL: Retrieved November
14, 2014)
Baker, J.M., Deng, L., Khuanpur, S., Lee, C.-H., Glass, J., Morgan, N., and O'Shaughnessy, D. (2009b)
"Research Developments and Directions in Speech Recognition and Understanding, Part 2." IEEE Signal
Processing Magazine, 26(6), 78-85.
Beesley, K.R. and and Karttunen, L. (2000) "Finite-state Non-concatenative Morphotactics." In SIGPHON
2000. 1-12. [PDF]
Beesley, K.R. and and Karttunen, L. (2003) Finite-State Morphology. CSLI Publications.
Bender, E. and Shah, C. (2022) "All-knowing machines are a fantasy." IAI News. (Text)
Bird, S. (2003) "Phonology." In R. Mitkov (ed.) (2003), pp. 3-24.
Bird, S., Klein, E., and Loper, E. (2009) Natural Language Processing with Python. O'Reilly Media.
[Abbreviated above as BKL]
Carpenter, B. (2003) "Complexity." In R. Mitkov (ed.) (2003), pp. 178-200.
Colby, K.M. (1981) "Modeling a paranoid mind." Behavioral and Brain Sciences, 4(4), 515-534.
Dutoit, T. and Stylianou, Y. (2003) "Text-to-Speech Synthesis." In R. Mitkov (ed.) (2003), pp. 323-338.
Du, M., He, F., Zou, N., Tao, D., and Hu, X. (2024) "Shortcut Learning of Large Language Models in Natural
Language Understanding." Communications of the ACM, 67(1), 110-19. [Text]
Dutta, SW. and Chakraborty, T. (2023) "Thus Spake ChatGPT: On the Reliability of AI-based Chatbots for
Science Communication." Communications of the ACM, 66(12), 16-19. [Text]
Edwards, C. (2021) "The best of NLP." Communications of the ACM, 64(4), 9-11. [Text]
Epstein, R. (2007) From Russia With Love: How I got fooled (and somewhat humiliated) by a computer."
Scientific American Mind, October, 6-17.
Gorevtt, Z. (2023) "The AI emotions dreamed up by ChatGPT." BBC Future. Accessed February 28, 2023.
[Text]
Greengard, S. (2023) "Computational Linguistics Finds Its Voice." Communications of the ACM, 66(2), 18-20.
[Text]
Grishman, R. (2003) "Information Extraction." In R. Mitkov (ed.) (2003), pp. 545-559
Harabagiou, S. and Moldovan, D. (2003) "Question Answering." In R. Mitkov (ed.) (2003), pp. 560-582
Hutchins, J. (2003) "Machine Translation: General Overview." In R. Mitkov (ed.) (2003), pp. 501-511.
Jurafsky, D. and Martin, J.H. (2008) Speech and Natural Language Processing (2nd Edition). Prentice-Hall.
[Abbreviated above as J&M]
Jurafsky, D. and Martin, J.H. (2022) Speech and Natural Language Processing (3rd Edition). (Book
Website) [Abbreviated above as J&M2]
Kaplan, R. (2003) "Syntax." In R. Mitkov (ed.) (2003), pp. 70-90.
Kay, M. (2003) "Introduction." In R. Mitkov (ed.) (2003), pp. xvii-xx.
Kedia, A, and Rasu, M. (2020) Hands-On Python Natural Language Processing. Packt Publishing;
Birmingham, UK.
Kelly, S.M. (2023) "The dark side of Bing's new AI chatbot." CNN Business. Accessed February 17, 2023.
[Text]
Kenstowicz, M.J. (1994) Phonology in Generative Grammar. Basil Blackwell.
Kiraz, G.A. (2000) "Multi-tiered nonlinear morphology using multitape finite automata: A case study on
Syriac and Arabic." Computational Linguistics, 26(1), 77-105. [PDF]
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 32/33
2/6/24, 11:51 PM Artificial Intelligence 6001, Winter '24: Course Diary (Wareham)
Lamel, L. and Gauvin, J.-L. R. (2003) "Speech Recognition." In R. Mitkov (ed.) (2003), pp. 305-322.
Lappin, S. (2003) "Semantics." In R. Mitkov (ed.) (2003), pp. 91-111.
Leech, G. and Weisser, M. (2003) "Pragmatics and Dialogue." In R. Mitkov (ed.) (2003), pp. 136-156.
Li, H. (2022) "Language models: past, present, and future." Communications of the ACM, 65(7), 56-63.
(HTML)
Lovins, J.B. (1973) Loanwords and the Phonological Structure of Japanese. PhD thesis, University of
Chicago.
Marcus, G. and Davis, E. ((2019) Rebooting AI: Building Artificial Intelligence We Can Trust. Pantheon
Books; New York.
Marcus, G. and Davis, E. (2021) "Insights for AI from the Human Mind", Communications of the ACM, 64(1),
38-41. [Text]
Martin-Vide, C. (2003) "Formal Grammars and Languages." In R. Mitkov (ed.) (2003), pp. 157-177.
Mitkov, R. (ed.) (2003) The Oxford Handbook of Computational Linguistics. Oxford University Press.
Mitkov, R. (2003a) "Anaphora Resolution." In R. Mitkov (ed.) (2003), pp. 266-283.
Mohri, M. (1997) "Finite-state transducers in language and speech processing." Computational Linguistics,
23(2), 269-311. [PDF]
Nederhof, M.-J. (1996) "Introduction to Finite-State Techniques." Lecture notes. [PDF]
Ramsay, A. (2003) "Discourse." In R. Mitkov (ed.) (2003), pp. 112-135.
Reis, E. S. D., Costa, C. A. D., Silveira, D. E. D., Bavaresco, R. S., Righi, R. D. R., Barbosa, J. L. V., and
Federizzi, G. (2021) "Transformers aftermath: current research and rising trends." Communications of the
ACM, 64(4), 154-163. [Text]
Roche, E. and Schabes, Y. (eds.) (1997) Finite-state Natural Language Processing. The MIT Press.
Roche, E. and Schabes, Y. (1997a) "Introduction." In E. Roche and Y. Schabes (Eds.) (1997), pp. 1-66.
Schmidhuber, J. (2015) "Deep learning in neural networks: An overview." Neural Networks, 61, 85-117.
Somers, H. (2003) "Machine Translation: Latest Developments." In R. Mitkov (ed.) (2003), pp. 512-528.
Sproat, R. (1992) Morphology and Computation. The MIT Press.
Tidy, J. (2024) "Character.ai: Young people turning to AI therapist bots." BBC Technology. Accessed January
8, 2024. [Text]
Trost, H. (2003) "Morphology." In R. Mitkov (ed.) (2003), pp. 25-47.
Tzoukermann, E., Klavans, J.L., and Strazalkowski, T. (2003) "Information Retrieval." In R. Mitkov (ed.)
(2003), pp. 529-544.
Vajjala, S., Majumder, B., Gupta, A., and Surana, H. (2019) Practical Natural Language Processing: A
Comprehensive Guide to Building Real-world NLP Systems. O'Reilly; Boston, MA.
Videla, A. (2023) "Echoes of Intelliogence: Textual Interpretation and Large Language Models."
Communications of the ACM, 66(11), 38-43. [Text]
Weizenbaum, J. (1966) "ELIZA - A computer program for the study of natural language communication
between man and machine." Communications of the ACM, 9(1), 36-45.
Weizenbaum, J. (1967) "Contextual understanding by computers." Communications of the ACM, 10(8), 474-
480.
Weizenbaum, J. (1979) Computer power and human reason. W.H. Freeman.
(end of diary)
https://www.cs.mun.ca/~harold/Courses/AI6001/Diary/index.html 33/33