You are on page 1of 45

D ISSOCIATING LANGUAGE AND THOUGHT IN

LARGE LANGUAGE MODELS : A COGNITIVE PERSPECTIVE

A P REPRINT

Kyle Mahowald* Anna A. Ivanova*


The University of Texas at Austin Massachusetts Institute of Technology
mahowald@utexas.edu annaiv@mit.edu
arXiv:2301.06627v1 [cs.CL] 16 Jan 2023

Idan A. Blank Nancy Kanwisher


University of California Los Angeles Massachusetts Institute of Technology
iblank@psych.ucla.edu ngk@mit.edu

Joshua B. Tenenbaum Evelina Fedorenko


Massachusetts Institute of Technology Massachusetts Institute of Technology
jbt@mit.edu evelina9@mit.edu

January 18, 2023

A BSTRACT
Short abstract (100 words): Large language models (LLMs) have come closest among all models
to date to mastering human language, yet opinions about their capabilities remain split. Here,
we evaluate LLMs using a distinction between formal competence—knowledge of linguistic rules
and patterns—and functional competence—understanding and using language in the world. We
ground this distinction in human neuroscience, showing that these skills recruit different cognitive
mechanisms. Although LLMs are close to mastering formal competence, they still fail at functional
competence tasks, which often require drawing on non-linguistic capacities. In short, LLMs are good
models of language but incomplete models of human thought.
Long abstract (250 words): Today’s large language models (LLMs) routinely generate coherent,
grammatical and seemingly meaningful paragraphs of text. This achievement has led to speculation
that these networks are—or will soon become—“thinking machines”, capable of performing tasks that
require abstract knowledge and reasoning. Here, we review the capabilities of LLMs by considering
their performance on two different aspects of language use: ‘formal linguistic competence’, which
includes knowledge of rules and patterns of a given language, and ’functional linguistic competence’,
a host of cognitive abilities required for language understanding and use in the real world. Drawing
on evidence from cognitive neuroscience, we show that formal competence in humans relies on
specialized language processing mechanisms, whereas functional competence recruits multiple
extralinguistic capacities that comprise human thought, such as formal reasoning, world knowledge,
situation modeling, and social cognition. In line with this distinction, LLMs show impressive
(although imperfect) performance on tasks requiring formal linguistic competence, but fail on many
tests requiring functional competence. Based on this evidence, we argue that (1) contemporary
LLMs should be taken seriously as models of formal linguistic skills; (2) models that master real-life
language use would need to incorporate or develop not only a core language module, but also multiple
non-language-specific cognitive capacities required for modeling thought. Overall, a distinction
between formal and functional linguistic competence helps clarify the discourse surrounding LLMs’
potential and provides a path toward building models that understand and use language in human-like
ways.

* The two lead authors contributed equally to this work.


A PREPRINT - JANUARY 18, 2023

Contents

1 Introduction 3

2 Formal vs. functional linguistic competence 4


2.1 What does linguistic competence entail? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Formal linguistic competence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Functional linguistic competence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Motivation for the distinction between formal vs. functional linguistic competence . . . . . . . . . . 6
2.2.1 The language network in the human brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 The language network does not support non-linguistic cognition . . . . . . . . . . . . . . . . 6

3 The success of large language models in acquiring formal linguistic competence 7


3.1 Statistical language models: some fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 What large language models can do: a case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Large language models learn core aspects of human language processing . . . . . . . . . . . . . . . . 9
3.3.1 LLMs learn hierarchical structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.2 LLMs learn abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 LLMs resemble the human language-selective network . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Limitations of LLMs as human-like language learners and processors . . . . . . . . . . . . . . . . . 12
3.5.1 Excessive reliance on statistical regularities . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.2 Unrealistic amounts of training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.3 Insufficient tests on languages other than English . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Interim Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 The failure of large language models in acquiring functional linguistic competence 13


4.1 LLMs are great at pretending to think . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 How LLMs fail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Limitations of LLMs as real-life language users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.1 Formal reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3.2 World knowledge and commonsense reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3.3 Situation modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.4 Social reasoning (pragmatics and intent) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Interim conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Building models that talk and think like humans 19


5.1 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Curated data and diverse objective functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Separate benchmarks for formal and functional competence . . . . . . . . . . . . . . . . . . . . . . . 20

6 General Conclusion 21

2
A PREPRINT - JANUARY 18, 2023

1 Introduction
When we hear a sentence, we typically assume that it was produced by a rational, thinking agent (another person). The
sentences that people generate in day-to-day conversations are based on their world knowledge (“Not all birds can fly.”),
their reasoning abilities (“You’re 15, you can’t go to a bar.”), and their goals (“Would you give me a ride, please?”).
Naturally, we often use other people’s statements not only as a reflection of their linguistic skill, but also as a window
into their mind, including how they think and reason.
In 1950, Alan Turing leveraged this tight relationship between language and thought to propose his famous test [Turing,
1950]. The Turing test uses language as an interface between two agents, allowing human participants to probe the
knowledge and reasoning capacities of two other agents to determine which of them is a human and which is a machine.1
Although the utility of the Turing test has since been questioned, it has undoubtedly shaped the way society today thinks
of machine intelligence [French, 1990, 2000, Boneh et al., 2019, Pinar Saygin et al., 2000, Moor, 1976, Marcus et al.,
2016].
The popularity of the Turing test, combined with the fact that language can, and typically does, reflect underlying
thoughts has led to several common fallacies related to the language-thought relationship. We focus on two of these.
The first fallacy is that an entity (be it a human or a machine) that is good at language must also be good at thinking. If
an entity generates long coherent stretches of text, it must possess rich knowledge and reasoning capacities. Let’s call
this the “good at language -> good at thought” fallacy.
The rise of large language models [LLMs; Vaswani et al., 2017a, Devlin et al., 2019, Bommasani et al., 2021], most
notably OpenAI’s GPT-3 [Brown et al., 2020], has brought this fallacy to the forefront. Some of these models can
produce text that is difficult to distinguish from human output, and even outperform humans at some text comprehension
tasks [Wang et al., 2018, 2019a, Srivastava et al., 2022]. As a result, claims have emerged—both in the popular press
and in the academic literature—that LLMs represent not only a major advance in language processing but, more broadly,
in Artificial General Intelligence (AGI), i.e., a step towards a “thinking machine” (see e.g., Dale 2021 for a summary of
alarmist newspaper headlines about GPT-3). Some, like philosopher of mind David Chalmers Chalmers [2022], have
even taken seriously the idea that these models have become sentient [although Chalmers stops short of arguing that
they are sentient; see also Cerullo, 2022]. However, as we show below, LLMs’ ability to think is more questionable.
The “good at language -> good at thought” fallacy is unsurprising given the propensity of humans to draw inferences
based on their past experiences. It is still novel, and thus uncanny, to encounter an entity (e.g., a model) that generates
fluent sentences despite lacking a human identity. Thus, our heuristics for understanding what the language model is
doing—heuristics that emerged from our language experience with other humans—are broken.2 .
The second fallacy is that a model that is bad at thinking must also be a bad model of language. Let’s call this the “bad
at thought -> bad at language” fallacy. LLMs are commonly criticized for their lack of consistent, generalizable world
knowledge [e.g. Elazar et al., 2021a], lack of commonsense reasoning abilities [e.g., the ability to predict the effects of
gravity Marcus, 2020], and failure to understand what an utterance is really about [e.g., Bender and Koller, 2020a, Bisk
et al., 2020]. While these efforts to probe model limitations are useful in identifying things that LLMs can’t do, some
critics suggest that the models’ failure to produce linguistic output that fully captures the richness and sophistication of
human thought means that they are not good models of human language.
Chomsky said in a 2019 interview (Lex Fridman, 2019): “We have to ask here a certain question: is [deep learning]
engineering or is it science? [. . . ] On engineering grounds, it’s kind of worth having, like a bulldozer. Does it tell
you anything about human language? Zero.” The view that deep learning models are not of scientific interest remains
common in linguistics and psycholinguistics, and, despite a number of position pieces arguing for integrating such
models into research on human language processing and acquisition [Baroni, 2021, Linzen, 2019, Linzen and Baroni,
2021, Pater, 2019, Warstadt and Bowman, 2022, Lappin, 2021], this integration still encounters resistance (e.g., from
Chomsky above).
Both the “good at language -> good at thought” and the “bad at thought -> bad at language” fallacies stem from
the conflation of language and thought, and both can be avoided if we distinguish between two kinds of linguistic
competence: formal linguistic competence (the knowledge of rules and statistical regularities of language) and functional
linguistic competence (the ability to use language in the real world, which often draws on non-linguistic capacities). Of
course, language does not live in a vacuum and is fundamentally embedded and social, so the formal capacity is of
1
In later versions of the test, the number of conversation partners has been reduced to one.
2
Note that people also make a related fallacy, “bad at language -> bad at thought” (see Mahowald & Ivanova, 2022). Individuals
who are not native speakers of a language, who do not speak hegemonic dialects, or those suffering from disfluencies in their
productions due to developmental or acquired speech and language disorders are often incorrectly perceived to be less smart and less
educated [Kinzler, 2021, Kinzler et al., 2009, Hudley and Mallinson, 2015]

3
A PREPRINT - JANUARY 18, 2023

limited value without being integrated in a situated context [e.g., Clark, 1996, Hudley et al., 2020, Bucholtz and Hall,
2005, Labov, 1978, Wittgenstein, 1953, Grice, 1975, Lakoff, 1972, Clark, 1992]. But even solving the more restricted
problem of formal linguistic competence (e.g., what counts as a valid string of a language) is far from trivial and indeed
has been a major goal of modern linguistics.
Our motivation for the distinction between formal and functional linguistic competence comes from the human brain. A
wealth of evidence from cognitive science and neuroscience has established that language and thought in humans are
robustly dissociable: the machinery dedicated to processing language is separate from the machinery responsible for
memory, reasoning, and social skills [e.g., Fedorenko and Varley, 2016a, ; Section 2]. Armed with this distinction, we
evaluate contemporary LLM performance and argue that LLMs have promise as scientific models of one piece of the
human cognitive toolbox—formal language processing—but fall short of modeling human thought.
Ultimately, what “pure” LLMs can learn is necessarily constrained both by the information available in their training
data and by whether that information is learnable through a word prediction mechanism. It has turned out that quite a lot
of linguistic knowledge, e.g., about syntax and semantics, can be learned from language data alone [Potts, 2020, Merrill
et al., 2022, Bommasani et al., 2021], in our opinion far more than most researchers in the field would have guessed
5 or 10 years ago (see Merrill et al. [2022] for an argument of how semantic information is in-principle learnable
from language data, and Piantadosi and Hill [2022] for an argument that models can genuinely learn meaning). The
success of these models is a major development, with far-reaching implications. But LLMs’ success in developing
linguistic knowledge by predicting words using massive amounts of text does not guarantee that all aspects of thought
and reasoning could be learned that way (although, as we will discuss, some aspects of thought and reasoning can be
learned that way provided the relevant information is typically encoded in distributional patterns over words).
By saying that LLMs do not, in and of themselves, model human thought, we are not suggesting that AI approaches
which start from building LLMs will necessarily run up against hard limits. Indeed, at the end of this article, we discuss
current modular approaches in which separate architectures or diverse objectives are combined. InstructGPT [Ouyang
et al., 2022] and ChatGPT are examples of successes in this vein, in that they combine an LLM with Reinforcement
Learning from Human Feedback (RLHF) [Christiano et al., 2017], whereby human feedback is used to iteratively adjust
the trained models. In that sense, they are more than just LLMs and can learn based on more than just what is available
in massive amounts of passively observed text. For our purposes here, we will use the term LLMs to refer primarily to
“pure” language models (such as the original GPT-3) that are trained to predict held-out language tokens conditional on
the immediate linguistic context, from large corpora of naturally observed language use.
In the rest of the paper, we formulate an account of what we should and should not expect from a model of language
and evaluate contemporary LLMs within this framework. In Section 2, we elaborate on the constructs of formal and
functional linguistic competence and motivate this distinction based on the evidence from human cognitive science and
neuroscience. In Section 3, we discuss the successes of LLMs in achieving formal linguistic competence, showing that
models trained on word filling-in/prediction tasks capture numerous complex linguistic phenomena. Then, in Section
4, we consider several domains required for functional linguistic competence—formal reasoning, world knowledge,
situation modeling, and social-cognitive abilities—on which today’s LLMs fail, or at least perform much worse than
humans. In Section 5, we discuss the implications of our framework for building and evaluating future models of
language and Artificial General Intelligence (AGI). Finally, in Section 6, we summarize our key conclusions.

2 Formal vs. functional linguistic competence

2.1 What does linguistic competence entail?

2.1.1 Formal linguistic competence

We define formal linguistic competence as a set of core, specific capacities required to produce and comprehend a
given language. Specifically, it involves the knowledge of and flexible use of linguistic rules [e.g., Chomsky, 1957,
Comrie, 1989, Pinker and Jackendoff, 2005], as well as of non-rule-like statistical regularities that govern that language
[e.g., Jackendoff and Pinker, 2005, Goldberg, 2019, Bybee and Hopper, 2001a]. Well-recognized aspects of formal
competence entail knowing a language’s vocabulary and how it can be productively composed to form grammatical
utterances. For example, most users of Standard Written English say, “The dogs in my bedroom are asleep” rather
than “The dogs in my bedroom is asleep”, because the verb “to be” must match the number of the noun that is the
subject of the sentence (“the dogs”), even though that verb is closer to an intervening, singular noun (“bedroom”).
Linguistic competence also requires exquisite sensitivity to the kinds of regularities that characterize idiosyncratic
linguistic constructions. For instance, although English speakers know not to use the indefinite article “a” with plural
nouns—making a phrase like “a days” ill-formed—they also know that it is allowed in a special construction where an

4
A PREPRINT - JANUARY 18, 2023

Figure 1: Successful use of language relies on multiple cognitive skills, some of which (required for formal competence)
are language-specific and some (required for functional competence) are not. A failure to acquire a particular skill
would result in a specific type of language use deficit. Determining whether a particular failure stems from a gap in
formal competence or functional competence is key to evaluating and improving language models.

adjective and a numeral intervene: “a beautiful five days in New York” [Solt, 2007, Dalrymple and King, 2019, Keenan,
2013].
Human language users likely learn rules (or rule-like systems) along with thousands of idiosyncratic constructions
[Goldberg, 2019] through some combination of sophisticated statistical learning [Spelke, 2004, Aslin et al., 1998, Aslin,
2007, Bresnan, 2007, Bybee and Hopper, 2001b, Chater et al., 2006, Clark, 2014, Frank and Tenenbaum, 2011, Gerken,
2006, O’Donnell, 2011, Perfors et al., 2011, Saffran et al., 1996, Saffran and Thiessen, 2003] and innate conceptual,
and perhaps specifically linguistic, machinery [Berwick et al., 2011b, Chomsky, 1957, Gleitman, 1993, Jackendoff
and Jackendoff, 2002, Pinker, 2000, Pinker and Jackendoff, 2005, Pinker and Bloom, 1990]. The result is the human
ability to understand and produce language and to make judgments of the kind of utterances that are acceptable and
unacceptable in a language: “The customer ate.” but not “The customer devoured.”, “a beautiful five days in New York”
and not “a beautiful five day in New York”.

2.1.2 Functional linguistic competence

In addition to being competent in the rules and statistical regularities of language, a competent language user must be
able to use language to do things in the world [Bloom, 2002, Clark, 1996, Frank and Goodman, 2012, Grice, 1975,
1969, Slobin, 1996, Wilson and Sperber, 2002, Tomasello, 2010, Christiansen and Chater, 2016, Bucholtz and Hall,
2004]: to talk about things that can be seen or felt or heard, to reason about diverse topics, to make requests, to perform
speech acts, to cajole, prevaricate, and flatter. In other words, we use language to send and receive information from
other perceptual and cognitive systems, such as our senses and our memory, and we deploy words as part of a broader
communication framework supported by our sophisticated social skills. A formal language system in isolation is useless
to a language user unless it can interface with the rest of perception, cognition, and action.
This host of capacities to use language to do things in the world is distinct from formal competence and depends
crucially on aspects of non-linguistic cognition (Figure 1). Thus, we define functional linguistic competence as
non-language-specific cognitive functions that are required when we use language in real-world circumstances.

5
A PREPRINT - JANUARY 18, 2023

2.2 Motivation for the distinction between formal vs. functional linguistic competence

As noted in Section 1, our motivation for the distinction between formal and functional linguistic competence comes
from what we know about the functional architecture of the human mind. In humans, language is robustly dissociated
from the rest of high-level cognition, as well as from perception and action. Below we briefly summarize a body of
evidence from cognitive science and neuroscience that supports this dissociation.

2.2.1 The language network in the human brain

Human language processing draws on a set of interconnected brain areas in the frontal and temporal lobes (typically in
the left hemisphere). This ‘language network’ supports both comprehension (spoken, written, and signed; e.g., Deniz
et al., 2019, Fedorenko et al., 2010, MacSweeney et al., 2002, Regev et al., 2013, Scott et al., 2017) and production
Menenti et al., 2011, Hu et al., 2022b. Furthermore, the language network responds to stimulus features rather than
task demands, as evidenced by i) similar responses to linguistic input under passive listening/reading and task-driven
conditions [e.g., Diachek et al., 2020] and ii) similar patterns of fluctuations across participants when they process
naturalistic linguistic stimuli [e.g., Wilson et al., 2008, Lerner et al., 2011, Silbert et al., 2014, Blank and Fedorenko,
2017]. Further, the language network is sensitive to linguistic regularities at all levels: from phonological/sub-lexical,
to word level, to phrase/sentence level [Bautista and Wilson, 2016, Blank et al., 2016, Blank and Fedorenko, 2020,
Fedorenko et al., 2011, 2012, 2020, Regev et al., 2021] and supports linguistic operations that are related to both the
processing of word meanings and those related to combinatorial semantic and syntactic processing [Fedorenko et al.,
2020, Hu et al., 2022b]. This consistent recruitment for language across a broad range of conditions, as well as the fact
that damage to the language network leads to linguistic deficits [e.g., Bates et al., 2003, Broca, 1865, Damasio, 1992,
Mesulam, 2001, Mesulam et al., 2014, Saffran, 2000, Wernicke, 1874, Wilson et al., 2019], indicates that this set of
regions stores our linguistic knowledge representations—a set of mappings between linguistic forms and meanings.

2.2.2 The language network does not support non-linguistic cognition

The language network is remarkably selective for language alone. Evidence of a strong dissociation between language
processing and non-linguistic abilities comes from two main sources: a) behavioral investigations of individuals
with aphasia—a language impairment caused by damage to the language network, typically as a result of a stroke or
degeneration, and b) functional brain imaging studies of neurotypical adults.
Studies of individuals with aphasia provide a unique opportunity for testing which cognitive capacities rely on linguistic
representations. Of particular interest are cases of the so-called ‘global aphasia’, which affects both production and
comprehension. Individuals with global aphasia exhibit severe linguistic deficits that, at best, spare nothing but single
word comprehension for a small set of words. If some aspects of non-linguistic cognition draw on the same resources as
language, then individuals with severe linguistic deficits should invariably exhibit impaired performance on the relevant
non-linguistic tasks.
Despite the nearly complete loss of linguistic abilities, some individuals with severe aphasia have intact non-linguistic
cognitive abilities: they can play chess, compose music, solve arithmetic problems and logic puzzles, leverage their
world knowledge to perform diverse tasks, reason about cause and effect, and navigate complex social situations [Basso
and Capitani, 1985, Bek et al., 2010, Klessinger et al., 2007, Luria et al., 1965, Lecours and Joanette, 1980, Varley,
1998, Varley and Siegal, 2000, Varley et al., 2001, 2005, Willems et al., 2011, see Fedorenko and Varley, 2016b, for a
review].
Brain imaging techniques like functional MRI (fMRI) can be used to observe activity in the language network in
healthy individuals in real time. Given its high spatial resolution, fMRI is especially well-suited to study whether any
two cognitive abilities draw on the same brain structures. For example, to ask whether language and mathematical
reasoning recruit the same brain areas, we can have participants perform a language task and a math task while in
an MRI scanner and then test whether brain regions that are active during language processing are also active when
participants solve a math problem. Mirroring the findings from aphasia, this approach reveals that the language network
is extremely selective for language processing: it responds robustly and reliably when people listen to, read, or generate
sentences (Section 2.2.1), but not when they perform arithmetic tasks, engage in logical reasoning, understand computer
programs, listen to music, categorize objects or events, watch others’ actions, reason about people’s mental states, or
process non-verbal communicative information like facial expressions or gestures [e.g., Amalric and Dehaene, 2019,
Benn et al., 2021, Blank et al., 2014, Chen et al., 2021, Deen et al., 2015, Fedorenko et al., 2011, Ivanova et al., 2020,
Jouravlev et al., 2019, Liu et al., 2020, Monti et al., 2007, 2009] [Monti et al., 2012, Paunov et al., 2019, 2022, Pritchett
et al., 2018, Shain et al., 2022b].

6
A PREPRINT - JANUARY 18, 2023

In summary, evidence from individuals with aphasia and from brain imaging studies is remarkably consistent: the
mechanisms that process language in the human brain do not support non-linguistic cognitive tasks. The latter draw on
distinct mechanisms, as discussed in Section 4 below. This sharp dissociation suggests that in examining language
models’ functionality, it is important to separate their linguistic abilities from their abstract knowledge and reasoning
abilities, which can be probed—and perhaps even learned—through a linguistic interface but require much more than
formal linguistic competence.

3 The success of large language models in acquiring formal linguistic competence


In this section, we evaluate the performance of LLMs qua language models by asking whether these models have
made progress towards achieving formal linguistic competence, the kind of competence that is supported by the
language-selective network in the human brain. We argue that these models are surprisingly and impressively successful
at mastering this specific domain—dramatically more successful than the best systems from even 5-10 years ago. This
impressive advance stands in stark contrast with past claims about the limits of learning language from linguistic input
alone.

3.1 Statistical language models: some fundamentals

LLMs are the spiritual descendants of a number of earlier approaches in computational linguistics, including statistical
language modeling, word embeddings, and connectionism (an earlier term for the approach that morphed into today’s
deep learning). Similar to earlier statistical language models, LLMs are usually trained on a word prediction task (the
same task used for training n-gram models going back to at least Shannon’s work in the mid-20th century; see Jurafsky
and Martin [2009b] and Lee [2003b] for a historical overview). Similar to approaches in distributional semantics and
word embeddings [for overviews, see Baroni and Lenci, 2010, Erk, 2012, Lenci, 2008]), LLMs represent linguistic
information as vectors in a high-dimensional space. And, similar to earlier connectionist approaches [e.g., Rumelhart
and McClelland, 1986, 1987, Elman, 1993, 1990], they use neural networks that are modeled off the human brain,
whereby a series of model weights are learned and passed through a network in order to generate a response. All of
these approaches stand in contrast to models that use explicit, structured hierarchical representations of syntactic rules
[see Norvig, 2012, 2017 for a discussion of these two divergent paradigms)].
N-grams and word embedding models achieved some success in various domains in natural language processing (e.g.,
spelling correction, spam classification, sentiment analysis; Jurafsky and Martin [2009a], Lee [2003a]). However, they
never approached human-level performance on general language tasks like text generation, leading to claims that purely
statistical approaches would never be able to capture the richness of natural language, particularly in complex syntactic,
morphological, and semantic domains [e.g., Pinker and Prince, 1988]. For example, Everaert et al. [2015] argued
against statistical approaches to understanding human language use. They specifically claim that statistical approaches,
which use linear strings of words as input, are unlikely to learn complex syntactic features that require representing
phrases and sentences hierarchically rather than linearly. This pessimism is now challenged by LLMs.
Here, we focus on a class of LLMs known as transformers. We will use GPT-3, a popular transformer LLM, as an
example to explain how the training process works for these models.
First, a training set is constructed from a massive amount of text from the web. GPT-3, for example, is trained on 45
terabytes of text, or about 500 billion words. The text is broken into word piece tokens, which are either words or
word components. The decision of what tokens to use is based on character co-occurrence information rather than
morphological analysis. As a result, the sentence “GPT-3 can be used for linguistics” is broken into the following word
pieces: G, PT, -, 3, can, be, used, for, lingu, istics. This is typical: short common words like can and used are kept
whole. A novel word in the corpus (GPT-3, which of course did not exist when the model was trained) is broken up into
smaller pieces. And the word linguistics is broken into lingu- (allowing it to learn a potential relationship with related
words like lingual and lingua) and -istics (cladistics, statistics, etc.). GPT-3 has 50k unique word piece tokens.
During the training, GPT-3 has a simple training objective: predict the next word piece based on a fixed number of
previous word pieces (typically a few hundred). So if the input is “GPT-3 can be used for lingu- ____,” the trained
model might successfully predict that the next word piece is -istics. The predicted word piece is then compared with the
ground truth (which word piece actually occurred in that training sentence), and the feedback signal is propagated back
through the model to update its many (>100 billion) parameters.
GPT-3’s architecture, like that of other transformer models, has a number of key properties that makes it so successful.
First, the model contains many layers, each of which can access a mixture of information from earlier layers, allowing
the model to learn both low-level and high-level properties of the input at different layers [Tenney et al., 2019]. Second,
each layer has a gating mechanism — known as attention [Vaswani et al., 2017b] — that allows each word piece node

7
A PREPRINT - JANUARY 18, 2023

Figure 2: An example of output produced by GPT-3 in response to a text prompt. The model predicts one token at a
time, iteratively, based on what has been generated so far. The model demonstrates remarkable linguistic competence,
generating syntactically and semantically coherent output.

to selectively attend to any of the preceding word piece nodes. For instance, if some part of the model is responsible for
generating a pronoun, it can learn to selectively attend to earlier possible noun antecedents in order to generate the right
pronoun [Manning et al., 2020, Lakretz et al., 2019, 2022]. Finally, in the transformer architecture, the words in the
input are passed in at the same time (not linearly, word by word, as in past models) and used to generate a prediction all
at once [Vaswani et al., 2017b]. This makes the training more parallelizable and efficient, which makes it feasible to
train them on the enormous amount of data required.
By being trained for word prediction, transformer models learn a lot about the structure of language, including linguistic
features that, even recently, were thought to be beyond the scope of statistical models. These models have succeeded
not just on tests of general language understanding developed by the NLP community [e.g., GLUE tasks Wang et al.,
2018, 2019a], but, critically for our purposes, on tests of linguistic competence. The benchmark BLiMP [Warstadt et al.,
2020], for instance, contains minimal pairs of grammatical vs. ungrammatical sentences testing a diverse range of
difficult linguistic phenomena like filler-gap dependencies (Bert knew what many writers find vs. *Bert knew that many
writers find) and negative polarity licensing (The truck has clearly tipped over. vs. *The truck has ever tipped over.)
These examples are designed to be challenging. As Warstadt and Bowman [2022]summarize, RoBERTa base, another
transformer LLM achieves human-level performance on 6 out of 12 item types. GPT-3 performs successfully on most
of these items. Similar results are seen on other benchmarks like SyntaxGym [Gauthier et al., 2020], a suite of syntactic
language benchmarks.
Although it is tempting to move the goalposts and focus on what these models are still unable to do (see Bowman
[2022] for a discussion of the dangers of focusing on failures in NLP), we argue that the remarkable advances in LLMs’
ability to capture various linguistic phenomena should not be overlooked.

3.2 What large language models can do: a case study

We demonstrate some of the linguistic capabilities of GPT-3 by working through an example of its output, in response
to a prompt.

8
A PREPRINT - JANUARY 18, 2023

Consider the prompt and completion shown in Figure 2. This predicted continuation is typical of GPT-3’s output
quality and illustrative of the features that typify its knowledge of language. The model deftly uses a variety of complex
linguistic features that depend on preceding text. The very first word of the generated text is a pronoun (it), which
requires a prior referent. In this case, the prior referent is GPT-3, and the pronoun is chosen accordingly. The model
also correctly uses the elliptical do so, which refers to can produce text that. . . from the preceding sentence. It uses
consistent and correct passive voice constructions—making sure all verbs have the right number of arguments and that
they are semantically plausible. The adverb purely is placed naturally and sensibly (it is trained purely on, not the more
stilted it purely is trained on), and the prepositions are apt and used with plausible objects (on the statistics, of the
English language).
Further, GPT-3 maintains coherence with the previous sentence by correctly inferring from the reference to English that
the relevant statistics are the statistics of the English language. And it deploys a sophisticated discourse relationship (the
way in which two segments of a text are logically connected, typically across sentences): the first part of the sentence It
can do so is connected, in a sensible way, to the second half of the sentence with the connective even though. Other
connectives like and or because would be less felicitous.
The end of the sentence of syntax, semantics, or even writing reproduces a common English-language pattern for a
series of three parallel items, in which the list begins with two closely related words (in this case, syntax and semantics)
and concludes with a third, less related item (writing). The use of “even” in even writing elegantly marks that pattern.
Beyond just generating text, a widespread paradigm in NLP now involves treating the LLM as a few-shot or zero-shot
learner [Brown et al., 2020] and prompting it to perform tasks of interest, either by giving it a few examples or just by
giving instructions. This approach is known as prompting, and GPT-3 was the first model to successfully perform a
wide range of tasks with prompting alone (without task-specific training).
Of course, insofar as the goal of language is to communicate thoughts, GPT-3 is deficient since it has no thoughts
of its own. The lack of underlying meaning is particularly clear when one considers what it means when GPT-3’s
generated text in Figure 2 says that GPT-3 is trained with no knowledge of “writing”. It is hard to interpret that phrase
in a way that makes sense in the world. However, we argue that this failure is not a failure of GPT-3’s formal linguistic
competence, but of its functional competence (see Section 4).
Note that the current (as of January 2023) versions of GPT-3—including “ChatGPT”, “InstructGPT”, and the models
collectively referred to as “GPT-3.5” that descend from “InstructGPT”—are importantly different from the original
version of GPT-3 and its predecessor models in ways that depart from being LLMs in the sense we discuss in this
article. They are trained using at least three kinds of data: (1) purely unsupervised language in a generic next-token
context-conditional prediction task – the original notion of “language modeling”; (2) human reinforcement learning,
with human readers giving thumbs up or down based on whether the system’s responses are good; and (3) human
supervised learning, where humans were asked to write correct responses to a large number of prompts entered into the
GPT-3 playground [OpenAI, 2022a,b]. When we discuss LLMs, we are referring primarily to “pure LLMs” trained on
the first of the above data sources. But, as we discuss in Section 5, the inclusion of these additional objective functions
is a promising avenue for building better models.

3.3 Large language models learn core aspects of human language processing

For LLMs to be useful as models of language processing in humans, we must be convinced that the models encode the
abstract phonological, morphological, syntactic, and semantic rules that characterize human language. While there
are important and interesting differences between syntactic processing in LLMs and humans [Arehalli et al., 2022,
Van Schijndel and Linzen, 2021], there are also important similarities. Here, we review evidence that LLMs learn two
features that are argued by many to be central to human linguistic processing: hierarchical structure and abstraction [see
Hupkes et al., 2020, Lovering and Pavlick, 2022, Linzen and Baroni, 2021, Manning et al., 2020, Press et al., 2022,
Ettinger, 2020a, for discussions of the importance of these features in the context of LLMs]. Both of these features
address primarily the syntactic aspect of formal linguistic competence; we have chosen them because they have been
extensively covered in the linguistics and NLP literatures.

3.3.1 LLMs learn hierarchical structure


In human languages, words combine to make compositional meanings. When a sentence has multiple words, their
meanings do not simply get added linearly one by one. Instead, they can be combined hierarchically [e.g., Adger, 2003,
Bresnan, 1982, Chomsky, 1957, 1965, Frazier, 1979, Jackendoff and Jackendoff, 2002]. For instance, in the phrase
“the keys to the cabinet”, the meaning gets built up as follows: the first “the” combines with “keys”, the second “the”
combines with “cabinet”, “the cabinet” combines with “to”, and the resulting phrase “to the cabinet” combines with
“the keys”.

9
A PREPRINT - JANUARY 18, 2023

The hierarchical structure in language manifests in many ways. One prominent example is non-local feature agreement.
In English and many other languages, verbs agree with their subjects. For instance, a plural subject uses the verb “are”,
whereas a singular subject uses “is”. A non-hierarchical bigram model, which simply stores frequencies of two-word
strings, could learn that “The keys are on the table” is more probable than “The keys is on the table” by knowing that
“keys are” is more common than “keys is”. But such a model would not be able to learn that the subject and verb must
agree even if arbitrarily far apart. For instance, “The keys to the old, wooden kitchen cabinet are on the table” has six
intervening words between the subject and verb, and yet “are” still agrees with “keys” and not with the nearby “cabinet”.
However, a model that learns the underlying hierarchical structure of English sentences will have no trouble keeping
track of the subject-verb dependency [Bernardy and Lappin, 2017, Finlayson et al., 2021b, Gulordava et al., 2018,
Kuncoro et al., 2018, Lakretz et al., 2019, Linzen et al., 2016, Lu et al., 2020, Lasri et al., 2022b, Mueller et al., 2022].
We highlight two main methods here that have been used to test whether LLMs learn hierarchical linguistic structure
(although there are many others: see Belinkov et al. [2020] for an overview). The first is to treat the models as
psycholinguistic subjects and see how they handle grammatical tasks that require operating over hierarchical structure
[Futrell et al., 2018, Wilcox et al., 2019, 2021, Linzen et al., 2016, Ettinger, 2020a]. Assessing the ability of models
to correctly do number agreement between subjects and verbs has been a major focus of such work. The task is
straightforward: given that language models produce a probability distribution over all words in their vocabulary, one
can simply measure the probability assigned to the plural verb and compare it to the probability assigned to the singular
verb (e.g., compare the probability of “is” vs. “are” given the prompt “The keys to the cabinet...”).
Today’s LLMs perform long-distance number agreement well above chance, even in the presence of intervening words
[Gulordava et al., 2018, Linzen and Baroni, 2021], although they can be distracted by frequency effects [such as
differences in the frequency between the singular and plural forms; Wei et al., 2021, Yu et al., 2020] and (as of GPT-2)
fall short of humans on number agreement tasks in nested sentences (Lakretz et al., 2021; although see Lampinen et al.,
2022). In a similar vein, it has been shown that these models can also handle other structure-sensitive constructions,
like filler-gap dependencies [Wilcox et al., 2018] or negative polarity [Warstadt et al., 2019, Marvin and Linzen, 2018].
Moreover, studies that turn on and off specific model “neurons” have provided mechanistic insight into how an LLM
might perform this task [Lakretz et al., 2019].
Another broad approach, typically known as probing [Tenney et al., 2019, Belinkov and Glass, 2019, Ettinger et al.,
2016, Belinkov, 2022, Pimentel et al., 2020, Lasri et al., 2022a, Pimentel and Cotterell, 2021, Conneau et al., 2017,
Giulianelli et al., 2018], uses an algorithm that takes as input the internal representations of a language model and learns
to map them to some linguistic feature of interest. This approach shows that distances between LLM representations of
individual words in a sentence align with the sentence’s hierarchical structure rather than with linear distance between
the words, thereby recovering the close structural relationship between the subject and verb of a sentence even when
they are linearly far apart [Hewitt and Manning, 2019].

3.3.2 LLMs learn abstractions


Following Ambridge [2020], we define an abstraction as a generalized linguistic representation—such as a part-of-
speech category or grammatical role—that goes beyond simple storage of input and allows for generalization. The very
notion of subject-verb agreement, outlined in the previous section, relies on the abstract categories of subject and verb.
Gulordava et al. [2018] gives the example that, in a sentence like “dogs in the neighborhood often. . . (bark/barks)”,
a model might learn a shallow version of the agreement rule, namely, that the collocation of “dogs” and “bark” in
the same sentence is more common than “dogs” and barks”. However, a model that has an abstract representation of
categories like grammatical subject, grammatical number, and verb should be able to handle long-distance number
agreement even for novel combinations of words. One way to test a model’s knowledge of abstract rule knowledge is
by using semantically nonsensical sentences, like “The colorless green ideas I ate with the chair. . . (sleep/sleeps)”.
Testing on Italian, English, Hebrew, and Russian, Gulordava et al. [2018] found that models performed well even on
these semantically empty sentences.
Further, a large body of work has used the probing methodology described above to test for linguistic abstraction
in LLMs. In this literature, a classifier is typically trained on top of model embeddings to ask whether an abstract
category, such as part-of-speech or dependency role, can be recovered from the model. Tenney et al. [2019] argued
that LLMs “rediscover the classical NLP pipeline,” learning at various layers features like part-of-speech categories,
parses, named entities, and semantic roles. However, one important limitation of such probing studies is that, even if
abstract categories can be decoded from model representations, a model might not necessarily be using this knowledge
[Belinkov, 2022, Hewitt and Liang, 2019, Ivanova et al., 2021a, Pimentel et al., 2020, Voita and Titov, 2020, Elazar
et al., 2021c, Ravfogel et al., 2021, Finlayson et al., 2021a, Wu et al., 2022, Tucker et al., 2022].
One possible objection is that perhaps LLMs learn syntactic information but in a way that is closely tied to the particular
lexical items used (e.g., they might learn the agreement rule for the specific verbs they encountered but not for an

10
A PREPRINT - JANUARY 18, 2023

abstract verb category). An even more stringent test, thus, is to test whether LLMs can apply morphosyntactic rules to
novel words. For instance, Kim and Smolensky [2021] show that BERT has some ability to generalize grammatical
categories. They give the model novel words, used in phrases, as input (e.g., “The blick” where blick is likely a noun and
“They dax” where dax is likely a verb) and test whether, based on the input, the model can generalize the part-of-speech
category. That is, given the example in the preceding sentence as input, the model should be able to know that “I went
to a blick” is more probable than “I went to a dax” since blick was used as a noun. They conclude that BERT succeeds
partially at this task: it does learn to generalize, but only after repeated examples [but see Kim et al., 2022, Misra et al.,
2022, for ways in which the word itself affects compositional ability]. More recent models, such as GPT-3, seem to
be able to use a novel word appropriately right away, at least if prompted correctly [Brown et al., 2020, McCoy et al.,
2021b].
A probing study in the same vein [Maudslay and Cotterell, 2021] attempted to recover hierarchical sentence structure
from LLM representations of sentences constructed in the style of the poem Jabberwocky, such that they are syntactically
English-like but use nonce words in place of content words (e.g., “I povicated your briticists very much”). Even in
this difficult case, probes performed above chance, although the performance was lower than for meaningful sentences.
These results suggest that, although language models may rely on lexical-semantic cues to some extent, linguistic
probes recover some abstract syntactic knowledge even in a very stringent test.
Note that a human-like language model is not expected to rely solely on abstract rules. Humans use diverse cues in
their language learning and processing that sometimes override or conflict with strict hierarchical syntactic processing
[e.g., MacDonald et al., 1994a, 1992, Seidenberg and MacDonald, 1999, MacDonald, 1993a, MacDonald et al.,
1994b, MacDonald, 2013, 1993b, Bates and MacWhinney, 1989a, 1987, 1989b, MacWhinney and MacWhinney, 1987,
Tanenhaus et al., 1995, Trueswell et al., 1993, Seidenberg et al., 1982, Trueswell and Tanenhaus, 1994, Altmann and
Kamide, 1999, Altmann and Steedman, 1988, Altmann and Kamide, 2007, Gibson and Pearlmutter, 1998, Rayner
and Frazier, 1989, Rayner et al., 2006]. This resource-constrained, “good enough” processing strategy [Ferreira et al.,
2002] leads to frequent syntactic errors, including in non-local subject-verb agreement cases discussed above [Bock and
Miller, 1991, Patson and Husband, 2016, Brehm and Bock, 2013, Christianson, 2016]. Humans also rely, to varying
extents, on memorizing previously seen input, as opposed to purely learning abstract rules [Ambridge, 2020, Bod, 2009,
Bybee and Hopper, 2001b, Goldberg, 2019, O’Donnell, 2015, Langacker, 1988, 2010, Goldberg, 2009].
Overall, it seems clear that LLMs achieve at least some degree of abstraction. The degree of that abstraction remains a
matter of debate, as it does for humans, but the fact that LLMs today already show evidence of representing hierarchical
structure and abstract linguistic patterns suggests that it is feasible to build powerful language models that learn linguistic
rules from textual input.

3.4 LLMs resemble the human language-selective network

Multiple studies have shown that LLMs can be used to predict activity in the language brain network in response to
novel sentences [e.g., Caucheteux and King, 2022, Goldstein et al., 2022, Schrimpf et al., 2021]. This correspondence
between artificial and biological neural networks for language suggests that perhaps these systems carry out similar
functions. A likely candidate objective for both is next-word prediction in the service of meaning extraction. Indeed,
Schrimpf et al. [2021] showed that LLMs that perform better on next-word prediction (but not other language tasks)
provide a better match to human behavioral and neural data (see, e.g., Wilcox et al., 2021 for additional behavioral
evidence).
Furthermore, the functional response properties of LLMs described above (Section 3.3) resemble those of the language
network. Similar to LLMs, the language network is sensitive to abstract hierarchical rules, as evidenced by studies
that use isolated phrases and sentences [e.g., Snijders et al., 2009, Fedorenko et al., 2010, 2016, Pallier et al., 2011,
Ding et al., 2016, Law and Pylkkänen, 2021, Nelson et al., 2017, Shain et al., 2021], naturalistic narratives [e.g., Shain
et al., 2020, Brennan et al., 2020, Heilbron et al., 2022, Reddy and Wehbe, 2021, Shain et al., 2022a], and syntactically
well-formed but semantically empty ("jabberwocky") stimuli [e.g., Fedorenko et al., 2010, Fedorenko and Varley,
2016b, Shain et al., 2021, Pallier et al., 2011, Humphries et al., 2006, Matchin et al., 2017, 2019, Matchin and Wood,
2020]. However, just like LLMs, the language network is also sensitive to specific word co-occurrences [e.g., as
evidenced by sensitivity to n-gram surprisal; Shain et al., 2020].
Finally, the internal architecture of LLMs and the language network also shows a resemblance: neither system shows
clear spatial segregation for syntactic and semantic processing (LLMs: e.g., Tenney et al., 2019, Durrani et al., 2020,
Huang et al., 2021; brain: e.g., Dick et al., 2001, Fedorenko et al., 2020, Reddy and Wehbe, 2021, indicating that these
functions are tightly functionally coupled in both.

11
A PREPRINT - JANUARY 18, 2023

3.5 Limitations of LLMs as human-like language learners and processors

Although a preponderance of evidence suggests that LLMs learn some aspects of hierarchical structure and abstraction
and resemble the human language system in other ways, their behavior is not fully human-like. Below we consider three
common criticisms of LLMs as models of human language processing and provide some responses to these objections.

3.5.1 Excessive reliance on statistical regularities

Part of what makes LLMs succeed on language tasks is almost certainly the fact that they pick up on statistical
regularities to achieve good performance without necessarily learning what we think of as the relevant linguistic
information (e.g., hierarchical structure, abstract grammatical categories, etc.). In other words, the models can be "right
for the wrong reason" [McCoy et al., 2019] and leverage certain features in the input that aren’t the ones being tested.
For instance, adding noise or distracting information can degrade model performance on a variety of tasks [e.g. Belinkov
and Bisk, 2017, Khayrallah and Koehn, 2018, Wallace et al., 2019, Kassner and Schütze, 2020]. In some (but not all) of
these cases, it is shown that such noise does not similarly affect humans. Chaves [2020] makes an argument along these
lines, suggesting that the models are actually learning surface statistical regularities in order to model the probabilities
of various syntactic phenomena. Other evidence suggests that LLMs can be misled by simple frequency effects, such as,
in a task where it has to choose between a singular and plural form of a particular verb, always choosing the plural form
of verbs for which the plural form is much more frequent [Chaves and Richter, 2021].
These findings lead to the question: do LLMs simply store and regurgitate output? This does not appear to be the
case: McCoy et al. [2021a] explicitly explored the extent to which GPT-2 output is regurgitated from the training set
and found that, although n-grams up to length 4 often appeared in the training set, GPT-2 generated mostly novel
5-grams and above. They also showed that the model routinely generates plausible novel words that do not appear in
the training set. Thus, LLMs generate output based on a combination of word co-occurrence knowledge and abstract
morphosyntactic rules; whether this ratio can be adjusted to match that of humans (which is still debated; Section 3.3.2)
remains to be determined.

3.5.2 Unrealistic amounts of training data

Most LLMs that achieve near-human performance are trained on vastly more data than a child is exposed to. For
example, van Schijndel et al. [2019] have found that a model’s training dataset would need to be unrealistically large in
order to handle some constructions in a human-like way. Warstadt and Bowman [2022] estimate that GPT-3 sees 1000x
more language data than a 10-year-old human (but observe that RoBERTA sees only 100x more and still performs well
at grammaticality tasks). Therefore, even if linguistic information is, in principle, learnable from statistical regularities
in the input, in practice, human language learners likely rely on pre-existing biases in order to learn quickly from sparse
and noisy input—biases that today’s state-of-the-art models lack. This difference in the amount of “training data” that
models vs. human language learners require is sometimes taken to imply that the resulting model representations will
necessarily be fundamentally unlike human linguistic representations [see, e.g., Munn, 1950, Meister, 2022, for similar
discussions in the context of using non-human animals as models of human cognition].
However, there is reason for optimism. Ongoing work is actively exploring the extent to which models that are trained
on more realistic amounts and kinds of input and/or in otherwise more realistic ways can still learn critical aspects of
language [Warstadt and Bowman, 2022]. Several studies have found that some syntactic generalizations can occur
in BERT-like architectures that are trained with millions (as opposed to billions) of words [Hu et al., 2020, Zhang
et al., 2021]. For example, the ELECTRA model uses a binary word prediction objective, instead of an unconstrained
word prediction task, and achieves comparable performance to models trained on far more data [Clark et al., 2020].
BabyBERTa, a model trained on 5 million words of child-directed speech [Huebner et al., 2021]—similar in scale to what
a human child would encounter—learned some syntactic generalizations comparable to RoBERTa, a high-performing
model that is trained on 30 billion words.
Performance of smaller language models is far from perfect; for example, BabyBERTa could not represent several aspects
of grammar that human children find intuitive. Critically, however, improvements in language models—including the
use of more cognitively-inspired architectures and more realistic learning scenarios—could lead to strong performance
with orders of magnitude less training data than today’s state-of-the-art LLMs [see Zhuang et al., 2021, for evidence
for the success of a similar approach in vision models]. In that vein, an important question is what inductive biases are
introduced by the model architectures, whether those biases resemble the ones that enable humans to learn language
[McCoy et al., 2018, 2020, Ravfogel et al., 2019], and whether better architectures could better capture these biases and
enable faster learning on less data.

12
A PREPRINT - JANUARY 18, 2023

3.5.3 Insufficient tests on languages other than English


Because LLMs are data-hungry, they only work well on languages for which vast corpora are available. For most human
languages, this is not the case. More worryingly, the architectures themselves may be biased towards English and other
European languages [Blasi et al., 2022]: not all languages are equally easy to model given the existing infrastructure
[Mielke et al., 2019, Cotterell et al., 2018, Ravfogel et al., 2018]. Thus, we should proceed with caution in assuming
that the success of BERT, GPT-2/3, and other LLMs will extend to all languages. That said, evidence is growing of
strong performance in a variety of languages [Martin et al., 2020, Wang et al., 2019b, Pires et al., 2019], and successful
transfer of models to low-resource languages [Wang et al., 2020].

3.6 Interim Conclusions

LLMs today generate highly coherent, grammatical texts that can be indistinguishable from human output. In doing
so, they exhibit at least some knowledge of hierarchical structure and abstract linguistic categories while successfully
capturing human brain responses during language processing. These models are not perfect learners of abstract linguistic
rules, but neither are humans. We therefore conclude that LLMs are on track to acquiring formal linguistic competence.
LLMs have already overturned the claims about the fundamental impossibility of acquiring certain linguistic knowl-
edge—including hierarchical structure and abstract categories—from the input alone [but see Everaert et al., 2015]. If
language modeling continues to improve (including learning from more realistic kinds and amounts of data; Section
3.4.3), this would allow testing more general versions of this poverty of the stimulus argument [Berwick et al., 2011a,
Chomsky, 1991, Pullum and Scholz, 2002, McCoy et al., 2018], including specific tests of which, if any, inductive
biases are required to learn the rules and statistical regularities of human language. As such, LLMs have substantial
value in the scientific study of language learning and processing.

4 The failure of large language models in acquiring functional linguistic competence


4.1 LLMs are great at pretending to think

Large text corpora contain a wealth of non-linguistic information, from mathematical and scientific facts (e.g., “two
plus seven is nine”) to factual knowledge (e.g., “the capital of Texas is Austin”) to harmful stereotypes (e.g., “women
belong in the kitchen”). This is not particularly surprising since even simple patterns of co-occurrence between words
capture rich conceptual knowledge, including object properties [e.g., Grand et al., 2022, Huebner and Willits, 2018,
Unger and Fisher, 2021, Utsumi, 2020, van Paridon et al., 2021], abstract analogies [Ichien et al., 2021], social biases
[e.g., Bolukbasi et al., 2016, Caliskan et al., 2017, Lewis and Lupyan, 2020], and expert knowledge in specialized
domains [e.g., Tshitoyan et al., 2019]. Moreover, statistical regularities extracted from language and from visual scenes
exhibit a substantial degree of correspondence [Roads and Love, 2020, Sorscher et al., 2021], indicating that linguistic
information can capture at least some aspects of experiential input [e.g., Abdou et al., 2021, Patel and Pavlick, 2021].
As a result, language models trained on gigantic text corpora acquire large amounts of factual knowledge [e.g.,
Kassner and Schütze, 2020, Elazar et al., 2021b, Ettinger, 2020b, Petroni et al., 2019, 2021], succeed at some types
of mathematical reasoning [e.g., Lewkowycz et al., 2022, Rae and Razavi, 2020] and reproduce many stereotypes
and social biases [Kurita et al., 2019, Kaneko et al., 2022, Wei et al., 2022a] All these behaviors — both positive and
negative — become more prominent as models gets larger [Rae et al., 2021, Srivastava et al., 2022], indicating that
larger storage capacity allows LLMs to learn increasingly more fine-grained patterns in the input.
These wide-ranging capabilities are, in a way, unsurprising given LLMs’ impressive linguistic skills that we covered in
Section 3. The very same pattern learning mechanisms enable LLMs to infer that “keys” go with “are”, not “is”, and
that “The capital of Texas” is “Austin”, not “Boston”. But do LLMs actually use this information to reason about the
world? How well do they generalize beyond specific word co-occurrences? Can they use their knowledge appropriately
depending on context?
In the rest of this section, we argue that LLMs, in fact, struggle with non-language-specific capabilities. If a test scenario
is common enough, then the model can seemingly succeed — but studies that test generalized reasoning abilities
highlight major limitations of LLMs as models of human cognition.

4.2 How LLMs fail

Any test of LLMs’ ability to reason must account for their ability to use word co-occurrence patterns to “hack” the task.
In some cases, researchers can circumvent this hacking behavior by constructing unusual prompts, ensuring that an
answer cannot be found in (or easily constructed from) the models’ training data. Consider the study by Collins, Wong

13
A PREPRINT - JANUARY 18, 2023

et al. (2022). The authors asked both GPT-3 and human participants to come up with solutions for a set of unusual
tasks; an independent group of human raters then evaluated the goodness of the proposed solution. A representative
example is below:
Prompt: Get your sofa onto the roof of your house.
GPT-3 response: I would start by getting a very strong ladder and a very strong friend...
(goodness: 3.6 out of 7)
Human response: You may need to rent a Genie lift large enough to carry the sofa. You will need
at least one other person... (goodness: 4.8 out of 7)

To manipulate how “out-of-distribution” the query is, the prompts get progressively more constrained:
Prompt: Get your sofa onto the roof of your house, without using a pulley.
GPT-3 response: Use a rope to tie around the sofa and connect it to a car. (goodness: 3.0 out of 7)
Human response: I would get a giant crane... and use the crane to lift it to the roof of my
house. (goodness: 5.1 out of 7)

With more and more constraints, human responses remain at approximately the same level of goodness, whereas model
performance breaks down completely:
Prompt: Get your sofa onto the roof of your house, without using a pulley, a ladder, a crane...
GPT-3 response: Cut the bottom of the sofa so that it would fit through the window...break the
windows to make room for the sofa. (goodness: 2.7 out of 7)
Human response: I will build a large wooden ramp...on the side of my house with platforms every
5 feet... (goodness: 5.0 out of 7)

GPT-3’s failure to perform well on out-of-distribution problems indicates that LLMs can leverage existing patterns
in the data but struggle to come up with creative solutions to novel tasks. Many other studies also highlight LLMs’
limitations when it comes to non-shallow reasoning tasks [for a summary, see e.g. Helwe et al., 2021]. Note that the
formal linguistic competence of GPT-3 remains high: its output is grammatical and coherent. Only when we attempt to
evaluate the merit of the proposed solution, do its shortcomings become apparent.

4.3 Limitations of LLMs as real-life language users

Why should we care whether LLMs can think? One reason, of course, is a wealth of claims that contemporary LLMs
are precursors to artificial general intelligence [AGI; e.g., Romero, 2022, Yan, 2021]. However, there is another reason,
which is of substantial relevance to both AI researchers and language scientists: real-life language use is impossible
without non-linguistic cognitive skills. Understanding a sentence, reasoning about its implications, and deciding what
to say—these skills all rely on cognitive capacities that go way beyond lexical semantics or syntax.
We focus on four key capacities that are not language-specific but are nevertheless crucial for language use in real-life
settings: i) formal reasoning—a host of abilities including logical reasoning, mathematical reasoning, relational
reasoning, computational thinking, and novel problem solving; ii) world knowledge—knowledge of objects and their
properties, actions, events, social agents, facts, and ideas; iii) situation modeling—the dynamic tracking of protagonists,
locations, and events as a narrative/conversation unfolds over time; and iv) social reasoning—understanding the social
context of linguistic exchanges, including what knowledge is shared, or in ‘common ground’, what the mental states of
conversation participants are, and pragmatic reasoning ability. A simple conversation typically requires the use of all
four of these capacities, yet none of them are specific to language use. Below, we provide evidence that these skills rely
on non-language-specific processing mechanisms in humans and highlight LLMs’ failures as relevant to each domain.

4.3.1 Formal reasoning

Language allows us to discuss highly abstract ideas, turn ideas into scientific and philosophical theories, construct
logical syllogisms, and engage in formal debates. Unsurprisingly, language is often considered a cornerstone of complex
reasoning [e.g., Baldo et al., 2005, 2010, Dennett, 1994, Carruthers, 2002, Grigoroglou and Ganea, 2022, Hinzen, 2013].

14
A PREPRINT - JANUARY 18, 2023

However, both neuroscience and studies of LLMs provide evidence that language and formal reasoning dissociate in
cognitive systems.
Humans. Despite their close interplay, language and reasoning rely on distinct cognitive and neural systems. A line of
work by Monti and colleagues [e.g., Monti et al., 2007, 2009, Coetzee and Monti, 2018] has investigated the brain basis
of logical inference in a task where the participants had to decide whether the conclusion followed from the premise ("If
X, then Y" => "If Y, then not X”). They found that, when contrasted with grammaticality judgments, logical inference
recruited a set of frontal and parietal brain regions that are distinct from the language network. These brain regions are
known as the multiple demand network [Duncan, 2010, 2013, Duncan et al., 2020], named so because they support
many different kinds of demanding cognitive tasks, including logic, mathematical reasoning [Amalric and Dehaene,
2016, 2019, Pinel and Dehaene, 2009, Fedorenko et al., 2013, Monti et al., 2012], physical reasoning [Fischer et al.,
2016, Schwettmann et al., 2019, Pramod et al., 2022], and computer code comprehension [Ivanova et al., 2020, Liu
et al., 2020]. Human patient studies have provided causal evidence for the role of the multiple demand network in
logical reasoning by showing that the amount of damage to these regions correlates negatively with performance on
executive function tasks [Gläscher et al., 2010, Woolgar et al., 2010, 2018]. Importantly, the multiple demand network
supports reasoning even when the task is presented linguistically [Amalric and Dehaene, 2016, 2019, Ivanova et al.,
2020, Monti et al., 2012] — similar to how LLMs receive their prompts.
LLMs. Several studies have criticized the extent to which language models can engage in tasks that require formal
reasoning, such as math problems expressed in words. Patel et al. [2021] showed that, while models can appear to solve
math problems [e.g., the dataset in Miao et al., 2020], they actually rely on heuristics and fail on more complicated
problems. Similarly, the creators of GPT-3 show that it performs well on two-digit addition and subtraction but not on
more complex tasks, such as three-digit addition or two-digit multiplication [Brown et al., 2020]. Reasoning tests that
break common co-occurrence patterns in the input or require multi-step operations also lead to model failure [Chowdhery
et al., 2022, Talmor et al., 2020b] To understand why, [Zhang et al., 2022] evaluate a BERT model trained from scratch
on a controlled dataset where classical logic rules hold in all cases. They show that the model behaves near-perfectly on
novel examples sampled from within the training distribution but fail to generalize to out-of-distribution examples that
would be easy to solve with simple logic rules [for related examples, see Lake and Baroni, 2018, Loula et al., 2018]. 3 .
Overall, evidence from LLMs is consistent with evidence from neuroscience: language and formal reasoning are distinct
cognitive capacities relying on distinct processing mechanisms.

4.3.2 World knowledge and commonsense reasoning

After people access the meanings of words and compose them into a semantically coherent whole, they get access to a
wealth of conceptual information. For instance, after reading the sentence “Nabiha took the keys from the table”, we
can access a wealth of implicit information about that event (what size the keys likely are; what the motion of taking the
keys from the table looks like; where the keys are now with respect to Nabiha). Should we expect LLMs to perform the
same inferences? Is language knowledge inextricably tied to world knowledge?
Humans. Evidence from neuroscience shows a dissociation between linguistic and semantic knowledge. On one
hand, individuals with aphasia struggle to produce grammatical utterances and retrieve contextually appropriate words,
but their ability to reason about objects and events presented as pictures often remains intact [Varley and Siegal,
2000, Ivanova et al., 2021b, Benn et al., 2021]. On the other hand, individuals who suffer from semantic dementia,
a neurodegenerative disorder affecting primarily anterior temporal lobes, struggle with world knowledge tasks (e.g.,
knowing that a zebra has stripes) regardless of whether the stimuli are presented as words or as pictures [Patterson et al.,
2007]. Thus, despite a tight coupling between language and world knowledge that is required for using language in real
life, they rely on distinct neural circuits.
LLMs. As discussed in Section 4.1, language models acquire a wealth of world knowledge contained in word co-
occurrence patterns. However, world knowledge in LLMs suffers from two major shortcomings. First, it is brittle.
Kassner and Schütze [2020] show that BERT can be tricked by appending a distractor prime before a question, a
technique they call “mispriming”. For instance, performance might degrade if we appended something like: “Boston?
The capital of Texas is ____.” Now, instead of saying Austin, it might instead supply the contextually salient Boston.
Moreover, LLMs often generate inconsistent outputs. For instance, when prompted with “Seinfeld premiered on ____”

3
A potential encouraging direction is to train/fine-tune language models on intermediate computations required to arrive at the
correct answer, the so-called “mental scratchpad” [Nye et al., 2021, Recchia, 2021] However, because the content of the mental
scratchpad in these examples is not linguistic, we consider these models to be outside the scope of this paper. Further, these models
still suffer from the generalization issues we describe above; for instance, when trained to do addition, models trained on 1-8-digit
numbers fail to generalize to 10 or more digits.

15
A PREPRINT - JANUARY 18, 2023

and "Seinfeld originally aired on ____", they might provide names of different TV networks despite the fact that the
prompts have the same semantic content Elazar et al. [see also Ravichander et al. 2020 2020].
Misra et al. [2022] show another interesting failure of even advanced models like GPT-3 on a task where they are asked
to reason about properties of novel objects. Models generally learn properties of objects, like that robins can fly and
penguins cannot. They can also seemingly generalize to novel objects, correctly giving high scores to “A wug is a robin.
Therefore, a wug can fly.” But this knowledge is brittle, and even highly performant models fail at: “A wug is a robin.
A dax is a penguin. Therefore, a dax can fly.” because of the distractor.
Second, world knowledge in LLMs is biased. Gordon and Van Durme [2013] note that learning about the world
from language corpora is challenging because much of world knowledge is implied: people are much more likely to
communicate new or unusual information rather than commonly known facts. Indeed, language models have impaired
knowledge of domains that are underreported, such as basic shape knowledge [e.g., “the wheels are round” Lucy
and Gauthier, 2017, Utsumi, 2020, Chersoni et al., 2021] and object size knowledge [e.g., “a table is smaller than an
airplane” Talmor et al., 2020a, Liu et al., 2022b]. These shortcomings arise from the fact that these models are trained
to extract statistical information about words in text rather than a set of stable, consistent, and complete facts about the
world. Any output generated by LLMs will be biased accordingly.
The world knowledge gaps become particularly prominent when it comes to commonsense reasoning tasks, which often
require using information that is not explicitly stated in the training data. A number of datasets have been developed for
testing commonsense reasoning in language models [Huang et al., 2019, Levesque et al., 2012, Zellers et al., 2018].
and, while models can succeed in some cases, they fail on more rigorous tests. For example, Elazar et al. [2021a] show
that models cannot generalize on Winograd scheme tasks [Levesque et al., 2012] that require commonsense reasoning
about the world. For instance, when presented with a sentence like “The lawyer asked the witness a question, but he
was reluctant to repeat it.”, most humans would agree that “he” refers to “lawyer” and not “witness”, but the model
would perform at chance. What is worse, Elazar and colleagues show that some past studies reporting model success
on commonsense reasoning rely on flawed evaluation techniques, which means that, going forward, we should be
exceedingly careful when distinguishing between statistical knowledge and true world knowledge in LLMs.

4.3.3 Situation modeling


People can easily follow the plot of a story that spans multiple chapters or, sometimes, multiple book volumes. We can
also have a three-hour conversation with a friend, and the next day the friend will expect us to remember most of what
was said. We accomplish this impressive feat not by having a dedicated memory slot for every word that we read or
heard, but by abstracting away linguistic information into a situation model — a mental model of entities, relations
between them, and a sequence of states they had been in [Van Dijk et al., 1983]. In addition to tracking long contexts, a
situation model allows us to seamlessly integrate linguistic and non-linguistic information, as in a sentence “Can you
pass me that? <points at a plate>” [Jackendoff, 2002]. Given the importance of situation modeling for language use,
should we treat it as part of formal linguistic competence?
Humans. As discussed in Section 2.2.1, the language network is sensitive to linguistic regularities that apply to units of
different ‘grain size’, from phonemes to morphemes to words to phrases and clauses. Curiously, it does not appear
to be sensitive to structure above the clause level (e.g., Blank and Fedorenko, 2020, Jacoby and Fedorenko, 2020,
Lerner et al., 2011 see also Yeshurun et al., 2017). This response profile suggests that the language network is distinct
from downstream processes that aggregate phrase- and sentence-level meanings into a coherent whole. The fact that
people’s memory for precise linguistic forms is relatively poor [e.g., Gurevich et al., 2010, Potter and Lombardi, 1998]
aligns well with the idea that the representations passed by the language system on to downstream systems are abstract
semantic in nature. A likely candidate for integrating these abstract representations is the so-called default network [e.g.,
Blank and Fedorenko, 2020, Ferstl and von Cramon, 2002, Kuperberg et al., 2000, Lerner et al., 2011, Simony et al.,
2016, ; for a general review of the default network, see Buckner and DiNicola, 2019]. Crucially, the default network
builds situation models for both linguistic and non-linguistic narratives [Baldassano et al., 2017, 2018], indicating that
situation modeling is not a language-specific skill.4
LLMs. Although there is some evidence that LLMs track internal representations of the entities’ states [Li et al., 2021],
they are, by design, incapable of tracking information over long contexts. As of today, their input window is typically
4
Recent work has suggested that the default network may consist of two distinct interdigitated sub-networks [Braga et al., 2019,
Deen and Freiwald, 2022, DiNicola et al., 2020]. One of these sub-networks appears to correspond to the Theory of Mind network
[Saxe and Kanwisher, 2003] discussed in 4.3.4 below. The exact contributions of the other sub-network remain debated, with
different proposals linking its functions to episodic projection [placing oneself into the past, when remembering things, or into the
future, when imagining things; Buckner et al., 2008], scene construction and situation modeling [Hassabis and Maguire, 2009] or
spatial cognition in general [Deen and Freiwald, 2022].

16
A PREPRINT - JANUARY 18, 2023

between 512 and 2,048 tokens — nowhere near enough to represent a full conversation, let alone a book (models
with “long” context windows can cover a book chapter; e.g., Beltagy et al. [2020]; Guo et al. [2021]). Furthermore,
LLMs struggle to accurately build situation models even over short spans of text: for instance, their outputs can refer to
non-existent discourse entities [“Arthur doesn’t own a dog. The dog is brown.” Schuster and Linzen, 2022]. And, of
course, models trained exclusively on text strings are, by design, incapable of using this knowledge to refer to real-world
entities [Bender et al., 2021, Bender and Koller, 2020b, Bisk et al., 2020], meaning that they are incapable of using
language in a physical environment the way humans do. Thus, LLMs in their current form are challenged to perform
an essential feat of language comprehension: integrating incoming language information into a general, multimodal,
dynamically evolving situation model.

4.3.4 Social reasoning (pragmatics and intent)

“Water!”
Wittgenstein famously used single-word utterances like this to show that linguistic meaning radically depends on
context. Although its literal interpretation is simply a reference to a physical entity, the intended meanings are much
more varied. Is the word being gasped by a thirsty person in the desert? By a hiker warning his friend of a hidden
stream? An impatient diner talking to a waiter? Work in cognitive science and linguistics has come to recognize that
these kind of grounded, context-dependent aspects of language are not just peripheral but a central part of human
language production and understanding [Bloom, 2002, Clark, 1996, Frank and Goodman, 2012, Grice, 1975, 1969,
Slobin, 1996, Wilson and Sperber, 2002, Tomasello, 2010, Christiansen and Chater, 2016].
The process of inferring the intended meaning of the utterance beyond its literal content is known as pragmatics
[Levinson et al., 1983]. This processes likely engage a variety of cognitive mechanisms [Andrés-Roqueta and Katsos,
2017, Levinson, 2000, Paunov et al., 2022]. Here, we focus on one core capacity required for pragmatics: social
reasoning.
Humans. A wealth of neuroscientific evidence shows that the human brain has dedicated machinery for processing
social information [Adolphs, 1999, 2009, Deen et al., 2015, Isik et al., 2017, Kanwisher et al., 1997, Lee Masson and
Isik, 2021, Saxe, 2006, Tarhan and Konkle, 2020, Walbrin et al., 2018, e.g., ]. Perhaps the most relevant to our current
discussion is the theory of mind network, a set of brain regions that are engaged when their owner is attempting to
infer somebody’s mental state [Fletcher et al., 1995, Gallagher et al., 2000, Jacoby et al., 2016, Saxe and Kanwisher,
2003, Saxe et al., 2006, Saxe and Powell, 2006]. The specific contributions of the theory of mind network to language
understanding can be divided into two broad categories. First, just like other functionally specialized brain modules, it
is engaged when processing semantic content that is specifically related to its domain (see Section 4.3.2): narratives that
require inferring the mental state of the characters engage the theory of mind network regardless of whether the actual
stimuli are texts or as movies [Jacoby et al., 2016, Paunov et al., 2022], and texts that require inferring the characters’
intentions evoke greater activity than those that do not [Fletcher et al., 1995, Ferstl and von Cramon, 2002, Saxe and
Powell, 2006]. Second, the theory of mind network is engaged more strongly during nonliteral language comprehension,
such as jokes, sarcasm, indirect speech, and conversational implicature [Hauptman et al., 2022, Spotorno et al., 2012,
Feng et al., 2017, 2021, Jang et al., 2013, van Ackeren et al., 2012, see Hagoort and Levinson, 2014, for a review] —
in other words, in situations where understanding the meaning of an utterance requires inferring the intentions of the
speaker. Thus, successful language understanding and use relies on our broader, non-language-specific social inference
skills.
LLMs. Recent versions of OpenAI’s GPT-3 models show a markedly improved capacity to interpret non-literal
utterances, such as metaphors and polite deceit, suggesting that it can reach human or near-human performance on at
least some pragmatic tasks [Hu et al., 2022a]. That said, today’s LLMs are still unable to interpret sarcasm or complete
jokes [Hu et al., 2022a]; what is perhaps even more important, they struggle on theory of mind tasks, which require
inferring the intentions behind others’ actions [Sap et al., 2022]. Thus, in agreement with neural evidence, linguistic
and social/pragmatic skills dissociate in LLMs.
Moreover, LLMs themselves lack communicative intent [Shanahan, 2022, Bender et al., 2021]. The closest they come
to intentionality is modeling a document-specific distribution of language patterns, which can result in generated strings
that are overall consistent with a particular person/agent [Andreas, 2022], but the intent behind these strings is still
missing. Globally speaking, these models have nothing to say. Nor should we expect them to: LLMs’ training objective
is maximizing next-/masked-word predictive accuracy, not generating utterances that allow them to achieve specific
goals in the world.
A consequence of the lack of communicative intent is that models’ attempts to automatically generate long stretches of
text often eventually degrade. Consider the example in Section 3.2, in which we asked GPT-3 to complete a prompt. It
was able to generate a mostly plausible continuation for a sentence or two. But, when prompted to continue further

17
A PREPRINT - JANUARY 18, 2023

Figure 3: GPT-3 is unable to infer and maintain the intent underlying an interaction, making it vulnerable to
the so-called “prompt injection” attacks. Example by Riley Goodman https://threadreaderapp.com/thread/
1569128808308957185.html.

after the same prompt, it started saying things that were false or misleading (e.g., “GPT-3 is a recurrent neural network
(RNN).”) and, after enough tokens, starts to ramble. For instance, it gives a caption for a non-existent figure:
Figure 2: Sample outputs from GPT-3. Top left : “It was the best of times, it was the worst of
times.” Top right : “The moon is made of green cheese.” Bottom left : “The man saw a dog going
into a house and went to the store for some milk.” Bottom right: “I am here today because my
father was not there yesterday.”
Although all of these sentences are grammatical (and actually obey a sensible scheme for a 4-paneled figure in an
academic paper), GPT-3 has no intention, no broader meaning to communicate, and so, at some point sufficiently
removed from the human-generated prompt, it will start becoming incoherent.
Further, even if given explicit instructions, LLMs can be easily distracted, as demonstrated by the example in Figure
3. Attempts to align the model’s output with the user’s intent often require adding an objective other than language
modeling [e.g., Ouyang et al., 2022, InstructGPT], and even those are imperfect. Overall, LLMs’ inability to infer and
maintain the goals of the interaction means that their outputs will often be meaningless and/or mis-specified despite
high linguistic well-formedness.

4.4 Interim conclusions

Real-life language use requires integrating language into a broader cognitive framework. In this section, we have shown
that many capacities required for language comprehension and production are, in fact, not specific to language and are
supported by distinct brain circuits. In line with this distinction, models that master many syntactic and distributional

18
A PREPRINT - JANUARY 18, 2023

properties of human language still cannot use language in human-like ways. In particular, they struggle when engaging
in formal reasoning, fail to acquire comprehensive and consistent world knowledge, cannot track objects, relations and
events in long inputs, and are unable to generate utterances intentionally or infer communicative intent from linguistic
input. In other words, their functional language competence remains in its infancy.
This is not to say that LLMs can only ever master formal linguistic competence. Some non-linguistic abilities that
contemporary LLMs succeed at include various forms of general pattern completion (“a, ab, abc, ?”), style transfer, and
long- and short-term memory. Nevertheless, their failure to master the four functional competence domains described
in this section is quite notable: in line with evidence from cognitive neuroscience, LLMs’ behavior highlights the
difference between being good at language and being good at thought.
The stark dissociation between formal and functional language competence in both humans and contemporary LLMs
raises a question: is it reasonable to model these diverse capabilities using a single system and a single objective
function? We turn to this question next.

5 Building models that talk and think like humans

The distinction between formal and functional competence has several important implications for building better models
of real-life language use. Here, we discuss three ingredients required to build models that talk and think like humans:
modularity, curated data combined with diverse objective functions, and separate benchmarks for formal and functional
competence.

5.1 Modularity

In this paper, we have advanced the thesis that functional competence and formal linguistic competence are distinct capa-
bilities, recruiting different machinery in the human brain. More broadly, most biological intelligent systems—including
both human and non-human minds—are highly modular [e.g., Carruthers, 2002, 2005, Cosmides and Tooby, 1994,
Fedorenko et al., 2011, Kanwisher et al., 1997, Meunier et al., 2010]. What can this modularity tell us about how to
build better, more human-like models?
We argue that future language models can master both formal and functional linguistic competence by establishing
a division of labor between the core language system and components for other cognitive processes, such as formal
logic and social reasoning. We see at least two ways to implement this division of labor: explicitly building modularity
into the architecture of the system (we call this Architectural Modularity) or naturally inducing modularity through the
training process, both through the training data and the objective function (we call this Emergent Modularity).
Architectural Modularity has a long history; it involves stitching together separate components, perhaps with quite
specialized architectures [e.g., Bottou and Gallinari, 1990, Ronco and Gawthrop, 1997]. More recent examples include
a transformer language model paired with a separate memory module [e.g., Borgeaud et al., 2022, d’Autume et al.,
2019, Liu et al., 2022a] or a model for visual question answering, which includes a language module, a vision module,
and a reasoning module [e.g., Yi et al., 2018, Mao et al., 2019, Andreas et al., 2016, Hudson and Manning, 2019,
Johnson et al., 2017]. Such modular models are capable of achieving high task performance, are more efficient (i.e.,
can be trained on smaller datasets and have lower memory demands), and show high generalizability (i.e., perform
well on datasets with previously unseen properties). The modules of such models can be trained separately or together,
similarly to how humans can flexibly combine different cognitive skills when learning to perform novel complex tasks.
The Emergent Modularity approach involves training models end-to-end (similarly to contemporary LLMs) but allows
modularity to develop naturally within the model. Modular structure has been shown to spontaneously emerge in
some end-to-end neural network systems in domains other than language [e.g., Yang et al., 2019, Dobs et al., 2022],
suggesting that modularity may constitute an optimal solution to many complex tasks. For this approach to be successful,
the model architecture must allow individual, specialized modules to develop within the model. Transformers, the
most popular architecture today, satisfy this condition to some extent by allowing different attention heads to attend to
different input features [e.g. Manning et al., 2020, Vaswani et al., 2017b, Vig and Belinkov, 2019]; certain approaches
promote modularization even more explicitly, e.g., by endowing transformers with a mixture-of-experts architecture
[Goyal et al., 2022, Kudugunta et al., 2021, Zhou et al., 2022].
A modular language model architecture is much better aligned with the fact that real-life language use is a complex
capability, requiring both language-specific knowledge (formal competence) and various non-language-specific cognitive
abilities (functional competence). Whether built-in or induced to emerge, modularity can lead the models to mirror the
functional organization of the human brain and, consequently, make their behavior much more humanlike.

19
A PREPRINT - JANUARY 18, 2023

5.2 Curated data and diverse objective functions

We argue that the approach that has dominated the field for the last five years—training LLMs on large “naturalistic”
text corpora from the web with a words-in-context prediction objective—is insufficient to induce the emergence of
functional linguistic competence. First, this approach is biased toward low-level input properties, leading to unstable
model behavior that depends on a particular way the prompt is phrased. Second, information contained in regular
text corpora does not faithfully reflect the world: for instance, it is biased toward unusual events and contains little
commonsense knowledge. Third, and perhaps most crucially, it incentivizes the models to learn patterns in the text
(at various levels of abstraction) but limits their ability to generalize out-of-distribution (see Section 4.3 for example
failures caused by all these issues). Finally, even in cases where LLMs succeed, the amount of naturalistic data required
for non-linguistic capacities to emerge is ridiculously large [Wei et al., 2022a], making this approach vastly inefficient
(and environmentally irresponsible).
Today, we already see examples where adjusting the training data and/or the objective function yields improved results.
One such example is a math model Minerva [Lewkowycz et al., 2022]. The system is built on the PaLM language
model, a transformer LLM similar to the ones we discussed here. But what makes Minerva successful on math problems
is that it was fine-tuned on a specialized math corpus, with special processing to make sure the mathematical notation is
machine-readable (with some additional tricks, such as chain-of thought prompting; Wei et al., 2022b). Examples of
LLMs that benefit from an additional objective function are InstructGPT [Ouyang et al., 2022] and ChatGPT5 , models
that build upon a large GPT-style LLM but are additionally trained using human feedback. In particular, they use
reinforcement learning to increase the likelihood of generating answers that a human might label as ‘good’, leading to
outputs that humans indeed consider to be high-quality, at least at first glance.
We believe that a model that succeeds at real-world language use would include–—in addition to the core language
component–—a successful problem solver, a grounded experiencer, a situation modeler, a pragmatic reasoner, and a goal
setter. In a way, we therefore arrive at the same conclusion as Turing [1950]: a model that masters language use, not just
the rules and patterns of natural language, has to be a general intelligence model. Furthermore, based on the functional
organization of the brain, the machinery required to simulate intelligence will include both domain-general components
(such as the multiple demand network [Duncan, 2010, Fedorenko et al., 2013]) and domain-specific components (such
as brain regions specialized for intuitive physics [Fischer et al., 2016], navigation [Epstein et al., 2017], and social
reasoning [Section 4.3.4]). This modularity could be baked in by training modular models (see Section 5.1) on a mixture
of carefully curated datasets using diverse objective functions (e.g., how GPTChat combines a pure language modeling
objective with an additional human feedback objective).

5.3 Separate benchmarks for formal and functional competence

To assess progress on the road toward building models that use language in humanlike ways, it is important to develop
benchmarks that evaluate both formal and functional linguistic competence. This distinction can reduce the confusion
that arises when discussing these models by combating the “good at language -> good at thought” and the “bad at
thought -> bad at language” fallacies.
Several existing benchmarks already evaluate formal linguistic competence in LLMs [e.g., Gauthier et al., 2020,
Warstadt et al., 2020] and can be complemented by additional tests of core linguistic features: hierarchy and abstraction
(Section 3.3). At present, no single benchmark for evaluating functional linguistic competence exists, and datasets that
target its subsets, like commonsense reasoning (e.g., WinoGrande from Sakaguchi et al., 2019; HellaSwag from Zellers
et al., 2019), can often be “hacked” by LLMs by leveraging flawed heuristics [Elazar et al., 2021b]. This issue is likely
exacerbated in large-scale heterogeneous datasets like BIG-bench [Srivastava et al., 2022]. However, it is certainly
possible to disentangle word-co-occurrence-based hacks and true reasoning capabilities, as evidenced, for instance, by
the example in Section 4.2.
Overall, we believe that developing comprehensive, separate assessments of formal linguistic competence and different
aspects of functional linguistic competence in language models will enable the field to develop models that excel at
both. Eventually, models should be able to solve complex tasks that require all aspects of linguistic competence, but
at this, still relatively early stage of building functionally competent models, it is important to target particular skills
known to be separable in humans, so as to be able to interpret the models’ failures.

5
https://openai.com/blog/chatgpt/

20
A PREPRINT - JANUARY 18, 2023

6 General Conclusion
The discourse around the latest crop of language models has consisted of a curious mix of overclaiming and under-
claiming [Bowman, 2022]. On the one hand, hyperbolic and fantastical articles in the press have claimed that models
like GPT-3 have solved language and will make human writers redundant. On the other hand, a steady stream of articles
within the academic literature have pointed out the many failures of LLMs: for instance, they get tripped up on tasks
that require abstract reasoning and start to meander when the discourse gets too long.
Here, we have put these seemingly inconsistent reactions in dialogue with prior and ongoing work in computational
linguistics, cognitive science, and neuroscience. In particular, we argue that LLMs are remarkably successful on tasks
that require a particular type of structural and statistical linguistic competence—formal linguistic competence. Although
their performance is not yet fully human-like, these models achieve an impressive degree of success in representing and
using hierarchical relationships among words and building representations that are sufficiently abstract to generalize to
new words and constructions. As such, these LLMs are underused in linguistics and cognitive science as candidate
models of human language processing [see also Baroni, 2021, Linzen, 2019, Linzen and Baroni, 2021, Pater, 2019,
Potts, 2020, Warstadt and Bowman, 2022]. We also review some of the LLMs’ failures on tasks that reflect real-life
language use, such as reasoning, while highlighting that the capabilities these tasks require are fundamentally distinct
from formal language competence and rely on specialized machinery in the human brain. In line with Turing [1950],
we conclude that a model that excels at real-life language use would need to be an AGI, and argue that a human-like
AGI cannot be reached simply by getting really good at predicting upcoming words.
The many failures of LLMs on non-linguistic tasks do not undermine them as good models of language processing.
After all, the set of areas that support language processing in the human brain also cannot do math, solve logical
problems, or even track the meaning of a story across multiple paragraphs. If we take the human mind and brain—a
good example of generalized intelligence—as a guide, we might expect that future advances in AGI will depend on
combining language models with models that represent abstract knowledge and support complex reasoning, rather than
expecting a single model (trained with a single word prediction objective) to do it all.
To those who have argued that most interesting aspects of human language cannot be learned from data alone, we
say that LLMs compellingly demonstrate the possibility of learning complex syntactic features from linguistic input
(even if, as of now, much more input is required than a typical child sees). To those who criticize LLMs for their
inability to do complex arithmetic or to fully reason about the world, we say, give language models a break: given a
strict separation of language and non-linguistic capabilities in the human mind, we should evaluate these capabilities
separately, recognizing success in formal linguistic competence even when non-linguistic capabilities lag behind.
Finally, to those who are looking to language models as a route to AGI, we suggest that, instead of or in addition to
scaling up the size of the models [Kaplan et al., 2020], more promising solutions will come in the form of modular
architectures—pre-specified or emergent—that, like the human brain, integrate language processing with additional
systems that carry out perception, reasoning, and planning.

Acknowledgements
For helpful conversations, we thank Jacob Andreas, Alex Warstadt, Dan Roberts, the attendees of the Harvard LangCog
journal club, and the attendees of the UT Austin Department of Linguistics SynSem seminar.

Funding Sources
AI was supported by funds from the Quest Initiative for Intelligence. EF was supported by NIH awards R01-DC016607,
R01-DC016950, and U01-NS121471 and by research funds from the Brain and Cognitive Sciences Department,
McGovern Institute for Brain Research, and the Simons Foundation though the Simons Center for the Social Brain.
KM acknowledges funding from NSF Grant 2104995.

Conflicts of Interest
The authors declare no Conflicts of Interest.

References
Mostafa Abdou, Artur Kulmizev, Daniel Hershcovich, Stella Frank, Ellie Pavlick, and Anders Søgaard. Can language
models encode perceptual structure without grounding? a case study in color. arXiv preprint arXiv:2109.06129,

21
A PREPRINT - JANUARY 18, 2023

2021.
David Adger. Core Syntax: A Minimalist Approach. Oxford University Press Oxford, 2003.
Ralph Adolphs. The Human Amygdala and Emotion. The Neuroscientist, 5(2):125–137, March 1999. ISSN 1073-8584.
doi: 10.1177/107385849900500216. URL https://doi.org/10.1177/107385849900500216. Publisher: SAGE
Publications Inc STM.
Ralph Adolphs. The Social Brain: Neural Basis of Social Knowledge. Annual review of psychology, 60:693–716, 2009.
ISSN 0066-4308. doi: 10.1146/annurev.psych.60.110707.163514. URL https://www.ncbi.nlm.nih.gov/pmc/
articles/PMC2588649/.
G. Altmann and Y. Kamide. Incremental interpretation at verbs: Restricting the domain of subsequent reference.
Cognition, 73(3):247–264, 1999.
G. Altmann and M. Steedman. Interaction with context during human sentence processing* 1. Cognition, 30(3):
191–238, 1988.
Gerry TM Altmann and Yuki Kamide. The real-time mediation of visual attention by language and world knowledge:
Linking anticipatory (and other) eye movements to linguistic processing. Journal of memory and language, 57(4):
502–518, 2007. Publisher: Elsevier.
Marie Amalric and Stanislas Dehaene. Origins of the brain networks for advanced mathematics in expert mathematicians.
Proceedings of the National Academy of Sciences of the United States of America, 113(18):4909–4917, May 2016.
ISSN 1091-6490. doi: 10.1073/pnas.1603205113.
Marie Amalric and Stanislas Dehaene. A distinct cortical network for mathematical knowledge in the human brain.
NeuroImage, 189:19–31, April 2019. ISSN 10538119. doi: 10.1016/j.neuroimage.2019.01.001. URL https:
//linkinghub.elsevier.com/retrieve/pii/S1053811919300011.
Ben Ambridge. Against stored abstractions: A radical exemplar model of language acquisition. First Language, 40
(5-6):509–559, 2020.
Jacob Andreas. Language Models as Agent Models, December 2022. URL http://arxiv.org/abs/2212.01681.
arXiv:2212.01681 [cs].
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks for question
answering. arXiv preprint arXiv:1601.01705, 2016.
Clara Andrés-Roqueta and Napoleon Katsos. The Contribution of Grammar, Vocabulary and Theory of Mind in
Pragmatic Language Competence in Children with Autistic Spectrum Disorders. Frontiers in Psychology, 8, 2017.
ISSN 1664-1078. URL https://www.frontiersin.org/articles/10.3389/fpsyg.2017.00996.
Suhas Arehalli, Brian Dillon, and Tal Linzen. Syntactic surprisal from neural models predicts, but underestimates,
human processing difficulty from syntactic ambiguities. arXiv preprint arXiv:2210.12187, 2022.
R.N. Aslin. What’s in a look? Developmental Science, 10(1):48–53, 2007.
R.N. Aslin, J.R. Saffran, and E.L. Newport. Computation of Conditional Probability Statistics by 8-Month-Old Infants.
Psychological Science, 9(4):321–324, 1998.
Christopher Baldassano, Janice Chen, Asieh Zadbood, Jonathan W. Pillow, Uri Hasson, and Kenneth A. Norman.
Discovering Event Structure in Continuous Narrative Perception and Memory. Neuron, 95(3):709–721.e5, August
2017. ISSN 1097-4199. doi: 10.1016/j.neuron.2017.06.041.
Christopher Baldassano, Uri Hasson, and Kenneth A. Norman. Representation of Real-World Event Schemas during
Narrative Perception. Journal of Neuroscience, 38(45):9689–9699, November 2018. ISSN 0270-6474, 1529-2401.
doi: 10.1523/JNEUROSCI.0251-18.2018. URL https://www.jneurosci.org/content/38/45/9689.
Juliana V. Baldo, Nina F. Dronkers, David Wilkins, Carl Ludy, Patricia Raskin, and Jiye Kim. Is problem solving
dependent on language? Brain and Language, 92:240–250, 2005. ISSN 1090-2155. doi: 10.1016/j.bandl.2004.06.103.
Place: Netherlands Publisher: Elsevier Science.
Juliana V. Baldo, Silvia A. Bunge, Stephen M. Wilson, and Nina F. Dronkers. Is relational reasoning dependent on
language? A voxel-based lesion symptom mapping study. Brain and Language, 113(2):59–64, May 2010. ISSN
1090-2155. doi: 10.1016/j.bandl.2010.01.004.
Marco Baroni. On the proper role of linguistically-oriented deep net analysis in linguistic theorizing. arXiv:2106.08694
[cs], June 2021. URL http://arxiv.org/abs/2106.08694. arXiv: 2106.08694.
Marco Baroni and Alessandro Lenci. Distributional memory: A general framework for corpus-based semantics.
Computational Linguistics, 36(4):673–721, 2010.

22
A PREPRINT - JANUARY 18, 2023

A Basso and E Capitani. Spared musical abilities in a conductor with global aphasia and ideomotor apraxia. Journal of
Neurology, Neurosurgery, and Psychiatry, 48(5):407–412, May 1985. ISSN 0022-3050. URL https://www.ncbi.
nlm.nih.gov/pmc/articles/PMC1028326/.
Elizabeth Bates and Brian MacWhinney. Competition, variation, and language learning. Mechanisms of language
acquisition, pages 157–193, 1987.
Elizabeth Bates and Brian MacWhinney. Functionalism and the competition model. The crosslinguistic study of
sentence processing, 3:73–112, 1989a.
Elizabeth Bates and Brian MacWhinney. Functionalism and the competition model. The crosslinguistic study of
sentence processing, 3:73–112, 1989b.
Elizabeth Bates, Stephen M. Wilson, Ayse Pinar Saygin, Frederic Dick, Martin I. Sereno, Robert T. Knight, and Nina F.
Dronkers. Voxel-based lesion-symptom mapping. Nature Neuroscience, 6(5):448–450, May 2003. ISSN 1097-6256.
doi: 10.1038/nn1050.
Alexa Bautista and Stephen M. Wilson. Neural responses to grammatically and lexically degraded speech. Lan-
guage, Cognition and Neuroscience, 31(4):567–574, April 2016. ISSN 2327-3798. doi: 10.1080/23273798.
2015.1123281. URL https://doi.org/10.1080/23273798.2015.1123281. Publisher: Routledge _eprint:
https://doi.org/10.1080/23273798.2015.1123281.
Judith Bek, Mark Blades, Michael Siegal, and Rosemary A. Varley. Language and spatial reorientation: evidence from
severe aphasia. Journal of Experimental Psychology. Learning, Memory, and Cognition, 36(3):646–658, May 2010.
ISSN 1939-1285. doi: 10.1037/a0018281.
Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):
207–219, March 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7.
Yonatan Belinkov and Yonatan Bisk. Synthetic and natural noise both break neural machine translation. arXiv preprint
arXiv:1711.02173, 2017.
Yonatan Belinkov and James Glass. Analysis methods in neural language processing: A survey. Transactions
of the Association for Computational Linguistics, 7:49–72, 2019. doi: 10.1162/tacl_a_00254. URL https:
//aclanthology.org/Q19-1004.
Yonatan Belinkov, Sebastian Gehrmann, and Ellie Pavlick. Interpretability and analysis in neural NLP. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 1–5,
Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-tutorials.1. URL https:
//aclanthology.org/2020.acl-tutorials.1.
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint
arXiv:2004.05150, 2020.
Emily M. Bender and Alexander Koller. Climbing towards NLU: On meaning, form, and understanding in the age
of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–
5198, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.463. URL
https://aclanthology.org/2020.acl-main.463.
Emily M. Bender and Alexander Koller. Climbing towards NLU: On Meaning, Form, and Understanding in the
Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages
5185–5198, Online, July 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.463.
URL https://www.aclweb.org/anthology/2020.acl-main.463.
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic
parrots: Can language models be too big. Proceedings of FAccT, 2021. URL https://dl.acm.org/doi/pdf/10.
1145/3442188.3445922.
Yael Benn, Anna A. Ivanova, Oliver Clark, Zachary Mineroff, Chloe Seikus, Jack Santos Silva, Rosemary Varley,
and Evelina Fedorenko. No evidence for a special role of language in feature-based categorization. bioRxiv, page
2021.03.18.436075, March 2021. doi: 10.1101/2021.03.18.436075. URL https://www.biorxiv.org/content/
10.1101/2021.03.18.436075v1. Publisher: Cold Spring Harbor Laboratory Section: New Results.
Jean-Philippe Bernardy and Shalom Lappin. Using deep neural networks to learn syntactic agreement. LiLT (Linguistic
Issues in Language Technology), 15, 2017.
Robert C Berwick, Paul Pietroski, Beracah Yankama, and Noam Chomsky. Poverty of the stimulus revisited. Cognitive
science, 35(7):1207–1242, 2011a.
Robert C Berwick, Paul Pietroski, Beracah Yankama, and Noam Chomsky. Poverty of the stimulus revisited. Cognitive
science, 35(7):1207–1242, 2011b.

23
A PREPRINT - JANUARY 18, 2023

Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki
Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, and Joseph Turian. Experience grounds language,
2020.
Idan A. Blank and Evelina Fedorenko. Domain-General Brain Regions Do Not Track Linguistic Input as Closely
as Language-Selective Regions. Journal of Neuroscience, 37(41):9999–10011, October 2017. ISSN 0270-6474,
1529-2401. doi: 10.1523/JNEUROSCI.3642-16.2017. URL https://www.jneurosci.org/content/37/41/9999.
Idan A. Blank and Evelina Fedorenko. No evidence for differences among language regions in their temporal receptive
windows. NeuroImage, page 116925, May 2020. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2020.116925. URL
http://www.sciencedirect.com/science/article/pii/S1053811920304110.
Idan A. Blank, Nancy Kanwisher, and Evelina Fedorenko. A functional dissociation between language and multiple-
demand systems revealed in patterns of BOLD signal fluctuations. Journal of Neurophysiology, 112(5):1105–1118,
September 2014. ISSN 1522-1598. doi: 10.1152/jn.00884.2013.
Idan A. Blank, Zuzanna Balewski, Kyle Mahowald, and Evelina Fedorenko. Syntactic processing is distributed across
the language system. NeuroImage, 127:307–323, February 2016. ISSN 10538119. doi: 10.1016/j.neuroimage.2015.
11.069. URL https://linkinghub.elsevier.com/retrieve/pii/S1053811915011064.
Damian Blasi, Antonios Anastasopoulos, and Graham Neubig. Systematic inequalities in language technology
performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland, May 2022. Association
for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.376. URL https://aclanthology.org/2022.
acl-long.376.
Paul Bloom. How children learn the meanings of words. MIT press, 2002.
Kathryn Bock and Carol A. Miller. Broken agreement. Cognitive psychology, 23(1):45–93, 1991. Publisher: Elsevier.
R. Bod. From exemplar to grammar: A probabilistic analogy-based model of language learning. Cognitive Science, 33
(5):752–793, 2009.
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer
programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing
systems, 29, 2016.
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein,
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv
preprint arXiv:2108.07258, 2021.
Dan Boneh, Andrew J Grotto, Patrick McDaniel, and Nicolas Papernot. How relevant is the turing test in the age of
sophisbots? IEEE Security & Privacy, 17(6):64–71, 2019.
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den
Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick,
Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela
Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and
Laurent Sifre. Improving language models by retrieving from trillions of tokens, February 2022. URL http:
//arxiv.org/abs/2112.04426. arXiv:2112.04426 [cs].
Léon Bottou and Patrick Gallinari. A framework for the cooperation of learning algorithms. Advances in neural
information processing systems, 3, 1990.
Samuel Bowman. The dangers of underclaiming: Reasons for caution when reporting how nlp systems fail. In
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pages 7484–7499, 2022.
Rodrigo M. Braga, Koene R. A. Van Dijk, Jonathan R. Polimeni, Mark C. Eldaief, and Randy L. Buckner. Par-
allel distributed networks resolved at high resolution reveal close juxtaposition of distinct regions. Journal of
Neurophysiology, 121(4):1513–1534, 2019. ISSN 1522-1598. doi: 10.1152/jn.00808.2018.
Laurel Brehm and Kathryn Bock. What counts in grammatical number agreement? Cognition, 128(2):149–169, 2013.
Jonathan R. Brennan, Chris Dyer, Adhiguna Kuncoro, and John T. Hale. Localizing syntactic predictions us-
ing recurrent neural network grammars. Neuropsychologia, 146:107479, September 2020. ISSN 0028-3932.
doi: 10.1016/j.neuropsychologia.2020.107479. URL https://www.sciencedirect.com/science/article/pii/
S0028393220301500.
ed. Bresnan, Joan. The mental representation of grammatical relations, volume 1. MIT press Cambridge, MA, 1982.

24
A PREPRINT - JANUARY 18, 2023

Joan Bresnan. Is syntactic knowledge probabilistic? Experiments with the English dative alternation. Roots: Linguistics
in search of its evidential base, 96:77–96, 2007.
Paul Broca. Sur le siège de la faculté du langage articulé (15 juin). Bulletins de la Société Anthropologque de Paris, 6:
377–393, 1865.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners. In
Advances in Neural Information Processing Systems, 2020. URL https://papers.nips.cc/paper/2020/file/
1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Mary Bucholtz and Kira Hall. Language and identity. A companion to linguistic anthropology, 1:369–394, 2004.
Mary Bucholtz and Kira Hall. Identity and interaction: A sociocultural linguistic approach. Discourse studies, 7(4-5):
585–614, 2005. Publisher: SAGE Publications London, Thousand Oaks, CA and New Delhi.
Randy L. Buckner and Lauren M. DiNicola. The brain’s default network: Updated anatomy, physiology and evolving
insights. Nature Reviews Neuroscience, 20(10):593–608, 2019. ISSN 1471-0048. doi: 10.1038/s41583-019-0212-7.
Place: United Kingdom Publisher: Nature Publishing Group.
Randy L. Buckner, Jessica R. Andrews-Hanna, and Daniel L. Schacter. The Brain’s Default Network. Annals of
the New York Academy of Sciences, 1124(1):1–38, 2008. ISSN 1749-6632. doi: 10.1196/annals.1440.011. URL
https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1196/annals.1440.011.
J.L. Bybee and P. Hopper. Frequency and the emergence of linguistic structure, volume 45. John Benjamins Publishing
Company, 2001a.
J.L. Bybee and P. Hopper. Frequency and the emergence of linguistic structure, volume 45. John Benjamins Publishing
Company, 2001b.
Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora
contain human-like biases. Science, 356(6334):183–186, 2017.
Peter Carruthers. The cognitive functions of language. The Behavioral and Brain Sciences, 25(6):657–674; discussion
674–725, December 2002. ISSN 0140-525X. doi: 10.1017/s0140525x02000122.
Peter Carruthers. The case for massively modular models of mind. Contemporary debates in cognitive science, ed. R.
Stainton, pages 205–25, 2005. Publisher: Citeseer.
Charlotte Caucheteux and Jean-Rémi King. Brains and algorithms partially converge in natural language processing.
Communications Biology, 5(1):1–10, February 2022. ISSN 2399-3642. doi: 10.1038/s42003-022-03036-1. URL
https://www.nature.com/articles/s42003-022-03036-1. Number: 1 Publisher: Nature Publishing Group.
Michael Cerullo. In Defense of Blake Lemoine and the Possibility of Machine Sentience in Lamda, 2022.
David Chalmers. Are large language models sentient?, 2022.
N. Chater, J.B. Tenenbaum, and A. Yuille. Probabilistic models of cognition: Conceptual foundations. Trends in
Cognitive Sciences, 10(7):287–291, 2006.
Rui P. Chaves. What Don’t RNN Language Models Learn About Filler-Gap Dependencies? In Proceedings of the third
meeting of the Society for Computation in Linguistics (SCiL), 2020.
Rui P Chaves and Stephanie N Richter. Look at that! bert can be easily distracted from paying attention to morphosyntax.
Proceedings of the Society for Computation in Linguistics, 4(1):28–38, 2021.
Xuanyi Chen, Josef Affourtit, Rachel Ryskin, Tamar I. Regev, Samuel Norman-Haignere, Olessia Jouravlev, Saima
Malik-Moraleda, Hope Kean, Rosemary Varley, and Evelina Fedorenko. The human language system does not sup-
port music processing, June 2021. URL https://www.biorxiv.org/content/10.1101/2021.06.01.446439v1.
Pages: 2021.06.01.446439 Section: New Results.
Emmanuele Chersoni, Enrico Santus, Chu-Ren Huang, and Alessandro Lenci. Decoding word embeddings with brain-
based semantic features. Computational Linguistics, 47(3):663–698, November 2021. doi: 10.1162/coli_a_00412.
URL https://aclanthology.org/2021.cl-3.20.
N. Chomsky. Syntactic Structures. The Hague: Mouton, 1957.
N. Chomsky. Aspects of the Theory of Syntax. MIT Press, Cambridge, MA, 1965.
Noam Chomsky. Linguistics and cognitive science: problems and mysteries. In The Chomskyan Turn. Blackwell,
Oxford, UK, 1991.

25
A PREPRINT - JANUARY 18, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, and others. PaLM: Scaling language modeling with
pathways. arXiv preprint arXiv:2204.02311, 2022.
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning
from human preferences. Advances in neural information processing systems, 30, 2017.
Morten H Christiansen and Nick Chater. Creating language: Integrating evolution, acquisition, and processing. MIT
Press, 2016.
Kiel Christianson. When language comprehension goes wrong for the right reasons: Good-enough, underspecified, or
shallow language processing. Quarterly journal of experimental psychology, 69(5):817–828, 2016.
Alexander Clark. Distributional Learning as a Theory of Language Acquisition. In Proceedings of the 5th Workshop
on Cognitive Aspects of Computational Language Learning (CogACLL), page 29, Gothenburg, Sweden, April
2014. Association for Computational Linguistics. doi: 10.3115/v1/W14-0506. URL https://aclanthology.org/
W14-0506.
Herbert H Clark. Arenas of language use. University of Chicago Press, 1992.
Herbert H Clark. Using language. Cambridge university press, 1996.
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. Electra: Pre-training text encoders
as discriminators rather than generators. In International Conference on Learning Representations, 2020. URL
https://openreview.net/forum?id=r1xMH1BtvB.
John P. Coetzee and Martin M. Monti. At the core of reasoning: Dissociating deductive and non-deductive load. Human
Brain Mapping, 39(4):1850–1861, 2018. ISSN 1097-0193. doi: 10.1002/hbm.23979.
Katherine M. Collins, Catherine Wong, Jiahai Feng, Megan Wei, and Joshua B. Tenenbaum. Structured, flexible, and
robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution
reasoning tasks, May 2022. URL http://arxiv.org/abs/2205.05718. arXiv:2205.05718 [cs].
Bernard Comrie. Language universals and linguistic typology: Syntax and morphology. University of Chicago press,
1989.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised Learning of Universal
Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing, pages 670–680. Association for Computational Linguistics,
2017. doi: 10.18653/v1/D17-1070. URL http://aclweb.org/anthology/D17-1070. event-place: Copenhagen,
Denmark.
Leda Cosmides and John Tooby. Origins of domain specificity: The evolution of functional organization. Mapping the
mind: Domain specificity in cognition and culture, 853116, 1994.
Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, and Brian Roark. Are all languages equally hard to language-model? In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 2 (Short Papers), pages 536–541, New Orleans, Louisiana, June 2018.
Association for Computational Linguistics. doi: 10.18653/v1/N18-2085. URL https://aclanthology.org/
N18-2085.
Robert Dale. Gpt-3: What’s it good for? Natural Language Engineering, 27(1):113–118, 2021.
Mary Dalrymple and Tracy Holloway King. An amazing four doctoral dissertations. Argumentum, 15(2019), 2019.
Publisher: Debreceni Egyetemi Kiado.
Antonio R. Damasio. Aphasia. New England Journal of Medicine, 326(8):531–539, February 1992. ISSN 0028-4793.
doi: 10.1056/NEJM199202203260806. URL https://doi.org/10.1056/NEJM199202203260806. Publisher:
Massachusetts Medical Society _eprint: https://doi.org/10.1056/NEJM199202203260806.
Cyprien de Masson d’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. Episodic Memory in Lifelong
Language Learning. In NeurIPS. arXiv, November 2019. doi: 10.48550/arXiv.1906.01076. URL http://arxiv.
org/abs/1906.01076. arXiv:1906.01076 [cs, stat].
Ben Deen and Winrich A. Freiwald. Parallel systems for social and spatial reasoning within the cortical apex, April
2022. URL https://www.biorxiv.org/content/10.1101/2021.09.23.461550v3. Pages: 2021.09.23.461550
Section: New Results.
Ben Deen, Kami Koldewyn, Nancy Kanwisher, and Rebecca Saxe. Functional Organization of Social Perception and
Cognition in the Superior Temporal Sulcus. Cerebral Cortex, 25(11):4596–4609, November 2015. ISSN 1047-3211.
doi: 10.1093/cercor/bhv111. URL https://doi.org/10.1093/cercor/bhv111.

26
A PREPRINT - JANUARY 18, 2023

Fatma Deniz, Anwar O. Nunez-Elizalde, Alexander G. Huth, and Jack L. Gallant. The Representation of Semantic
Information Across Human Cerebral Cortex During Listening Versus Reading Is Invariant to Stimulus Modality.
Journal of Neuroscience, 39(39):7722–7736, September 2019. ISSN 0270-6474, 1529-2401. doi: 10.1523/
JNEUROSCI.0675-19.2019. URL https://www.jneurosci.org/content/39/39/7722. Publisher: Society for
Neuroscience Section: Research Articles.
Daniel C Dennett. The role of language in intelligence. In What is Intelligence? The Darwin College Lectures, ed. Jean
Khalfa, Cambridge University Press, Cambridge, UK, 1994.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, 2019. URL https://www.aclweb.org/anthology/N19-1423.
Evgeniia Diachek, Idan Blank, Matthew Siegelman, Josef Affourtit, and Evelina Fedorenko. The Domain-General
Multiple Demand (MD) Network Does Not Support Core Aspects of Language Comprehension: A Large-Scale
fMRI Investigation. Journal of Neuroscience, 40(23):4536–4550, June 2020. ISSN 0270-6474, 1529-2401. doi: 10.
1523/JNEUROSCI.2036-19.2020. URL https://www.jneurosci.org/content/40/23/4536. Publisher: Society
for Neuroscience Section: Research Articles.
Frederic Dick, Elizabeth Bates, Beverly Wulfeck, Jennifer Utman, Nina Dronkers, and Morton Ann Gernsbacher.
LANGUAGE DEFICITS, LOCALIZATION, AND GRAMMAR: EVIDENCE FOR A DISTRIBUTIVE MODEL
OF LANGUAGE BREAKDOWN IN APHASIC PATIENTS AND NEUROLOGICALLY INTACT INDIVIDUALS.
Psychological review, 108(4):759–788, October 2001. ISSN 0033-295X. URL https://www.ncbi.nlm.nih.gov/
pmc/articles/PMC4301444/.
Nai Ding, Lucia Melloni, Hang Zhang, Xing Tian, and David Poeppel. Cortical tracking of hierarchical linguistic
structures in connected speech. Nature Neuroscience, 19(1):158–164, January 2016. ISSN 1546-1726. doi:
10.1038/nn.4186.
Lauren M. DiNicola, Rodrigo M. Braga, and Randy L. Buckner. Parallel distributed networks dissociate episodic
and social functions within the individual. Journal of Neurophysiology, 123(3):1144–1179, March 2020. ISSN
1522-1598. doi: 10.1152/jn.00529.2019.
Katharina Dobs, Julio Martinez, Alexander J. E. Kell, and Nancy Kanwisher. Brain-like functional specialization
emerges spontaneously in deep neural networks. Science Advances, 8(11):eabl8913, March 2022. ISSN 2375-2548.
doi: 10.1126/sciadv.abl8913.
John Duncan. The multiple-demand (MD) system of the primate brain: mental programs for intelligent behaviour.
Trends in Cognitive Sciences, 14(4):172–179, April 2010. ISSN 1879-307X. doi: 10.1016/j.tics.2010.01.004.
John Duncan. The Structure of Cognition: Attentional Episodes in Mind and Brain. Neuron, 80(1):35–50, October 2013.
ISSN 0896-6273. doi: 10.1016/j.neuron.2013.09.015. URL https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC3791406/.
John Duncan, Moataz Assem, and Sneha Shashidhara. Integrated Intelligence from Distributed Brain Activity. Trends
in Cognitive Sciences, 24(10):838–852, October 2020. ISSN 1364-6613, 1879-307X. doi: 10.1016/j.tics.2020.06.012.
URL https://www.cell.com/trends/cognitive-sciences/abstract/S1364-6613(20)30169-8. Publisher:
Elsevier.
Nadir Durrani, Hassan Sajjad, Fahim Dalvi, and Yonatan Belinkov. Analyzing Individual Neurons in Pre-trained
Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 4865–4880, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/
2020.emnlp-main.395. URL https://aclanthology.org/2020.emnlp-main.395.
Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, and Yoav
Goldberg. Measuring and improving consistency in pretrained language models. Transactions of the Association for
Computational Linguistics, 9:1012–1031, 2021a. doi: 10.1162/tacl_a_00410. URL https://aclanthology.org/
2021.tacl-1.60.
Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, and Yoav
Goldberg. Measuring and improving consistency in pretrained language models. Transactions of the Association for
Computational Linguistics, 9:1012–1031, 2021b.
Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic probing: Behavioral explanation with
amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 2021c. doi:
10.1162/tacl_a_00359. URL https://aclanthology.org/2021.tacl-1.10.
J.L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.

27
A PREPRINT - JANUARY 18, 2023

J.L. Elman. Learning and development in neural networks: the importance of starting small. Cognition, 48(1):71–99,
1993.
Russell A. Epstein, Eva Zita Patai, Joshua B. Julian, and Hugo J. Spiers. The cognitive map in humans: spatial navigation
and beyond. Nature Neuroscience, 20(11):1504–1513, October 2017. ISSN 1546-1726. doi: 10.1038/nn.4656.
Katrin Erk. Vector space models of word meaning and phrase meaning: A survey. Language and Linguistics Compass,
6(10):635–653, 2012.
Allyson Ettinger. What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models.
Transactions of the Association for Computational Linguistics, 8:34–48, 2020a. doi: 10.1162/tacl_a_00298. URL
https://doi.org/10.1162/tacl_a_00298. _eprint: https://doi.org/10.1162/tacl_a_00298.
Allyson Ettinger. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models.
arXiv:1907.13528 [cs], July 2020b. URL http://arxiv.org/abs/1907.13528. arXiv: 1907.13528.
Allyson Ettinger, Ahmed Elgohary, and Philip Resnik. Probing for semantic evidence of composition by means of simple
classification tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages
134–139, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-2524.
URL https://aclanthology.org/W16-2524.
Martin BH Everaert, Marinus AC Huybregts, Noam Chomsky, Robert C Berwick, and Johan J Bolhuis. Structures, not
strings: linguistics as part of the cognitive sciences. Trends in cognitive sciences, 19(12):729–743, 2015.
Evelina Fedorenko and Rosemary Varley. Language and thought are not the same thing: evidence from neuroimaging
and neurological patients. Annals of the New York Academy of Sciences, 1369(1):132, 2016a.
Evelina Fedorenko and Rosemary A. Varley. Language and thought are not the same thing: evidence from neuroimaging
and neurological patients: Language versus thought. Annals of the New York Academy of Sciences, 1369(1):132–153,
April 2016b. ISSN 00778923. doi: 10.1111/nyas.13046. URL http://doi.wiley.com/10.1111/nyas.13046.
Evelina Fedorenko, Po-Jang Hsieh, Alfonso Nieto-Castañón, Susan Whitfield-Gabrieli, and Nancy Kanwisher. New
method for fMRI investigations of language: defining ROIs functionally in individual subjects. Journal of Neuro-
physiology, 104(2):1177–1194, August 2010. ISSN 1522-1598. doi: 10.1152/jn.00032.2010.
Evelina Fedorenko, Michael K. Behr, and Nancy Kanwisher. Functional specificity for high-level linguistic processing
in the human brain. Proceedings of the National Academy of Sciences, 108(39):16428–16433, September 2011. ISSN
0027-8424, 1091-6490. doi: 10.1073/pnas.1112937108. URL https://www.pnas.org/content/108/39/16428.
Evelina Fedorenko, Alfonso Nieto-Castañón, and Nancy Kanwisher. Syntactic processing in the human brain: what we
know, what we don’t know, and a suggestion for how to proceed. Brain and Language, 120(2):187–207, February
2012. ISSN 1090-2155. doi: 10.1016/j.bandl.2011.01.001.
Evelina Fedorenko, John Duncan, and Nancy Kanwisher. Broad domain generality in focal regions of frontal and parietal
cortex. Proceedings of the National Academy of Sciences of the United States of America, 110(41):16616–16621,
October 2013. ISSN 1091-6490. doi: 10.1073/pnas.1315235110.
Evelina Fedorenko, Terri L. Scott, Peter Brunner, William G. Coon, Brianna Pritchett, Gerwin Schalk, and Nancy
Kanwisher. Neural correlate of the construction of sentence meaning. Proceedings of the National Academy of
Sciences, 113(41):E6256–E6262, October 2016. doi: 10.1073/pnas.1612132113. URL https://www.pnas.org/
doi/10.1073/pnas.1612132113. Publisher: Proceedings of the National Academy of Sciences.
Evelina Fedorenko, Idan Blank, Matthew Siegelman, and Zachary Mineroff. Lack of selectivity for syntax relative to
word meanings throughout the language network. bioRxiv, page 477851, February 2020. doi: 10.1101/477851. URL
https://www.biorxiv.org/content/10.1101/477851v2. Publisher: Cold Spring Harbor Laboratory Section:
New Results.
Wangshu Feng, Yue Wu, Catherine Jan, Hongbo Yu, Xiaoming Jiang, and Xiaolin Zhou. Effects of contextual relevance
on pragmatic inference during conversation: An fMRI study. Brain and Language, 171:52–61, August 2017. ISSN
0093-934X. doi: 10.1016/j.bandl.2017.04.005. URL https://www.sciencedirect.com/science/article/pii/
S0093934X16302887.
Wangshu Feng, Hongbo Yu, and Xiaolin Zhou. Understanding particularized and generalized conversational implica-
tures: Is theory-of-mind necessary? Brain and Language, 212:104878, January 2021. ISSN 0093-934X. doi: 10.1016/
j.bandl.2020.104878. URL https://www.sciencedirect.com/science/article/pii/S0093934X20301371.
Fernanda Ferreira, Karl GD Bailey, and Vittoria Ferraro. Good-enough representations in language comprehension.
Current directions in psychological science, 11(1):11–15, 2002.

28
A PREPRINT - JANUARY 18, 2023

Evelyn C. Ferstl and D. Yves von Cramon. What Does the Frontomedian Cortex Contribute to Language Processing:
Coherence or Theory of Mind? NeuroImage, 17(3):1599–1612, November 2002. ISSN 1053-8119. doi: 10.1006/
nimg.2002.1247. URL https://www.sciencedirect.com/science/article/pii/S1053811902912474.
Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal
analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting
of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pages 1828–1843, Online, August 2021a. Association for Computational
Linguistics. doi: 10.18653/v1/2021.acl-long.144. URL https://aclanthology.org/2021.acl-long.144.
Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal
Analysis of Syntactic Agreement Mechanisms in Neural Language Models. In Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), pages 1828–1843, Online, 2021b. Association for Computational
Linguistics. doi: 10.18653/v1/2021.acl-long.144. URL https://aclanthology.org/2021.acl-long.144.
Jason Fischer, John G. Mikhael, Joshua B. Tenenbaum, and Nancy Kanwisher. Functional neuroanatomy of intuitive
physical inference. Proceedings of the National Academy of Sciences, 113(34):E5072–E5081, August 2016. ISSN
0027-8424, 1091-6490. doi: 10.1073/pnas.1610344113. URL https://www.pnas.org/content/113/34/E5072.
P. C. Fletcher, F. Happé, U. Frith, S. C. Baker, R. J. Dolan, R. S. J. Frackowiak, and C. D. Frith. Other minds in the
brain: a functional imaging study of “theory of mind” in story comprehension. Cognition, 57(2):109–128, November
1995. ISSN 0010-0277. doi: 10.1016/0010-0277(95)00692-R. URL https://www.sciencedirect.com/science/
article/pii/001002779500692R.
M.C. Frank and J.B. Tenenbaum. Three ideal observer models for rule learning in simple languages. Cognition, 120(3):
360–371, 2011.
Michael C Frank and Noah D Goodman. Predicting pragmatic reasoning in language games. Science, 336(6084):
998–998, 2012. Publisher: American Association for the Advancement of Science.
L. Frazier. On comprehending sentences: Syntactic parsing strategies. ETD Collection for University of Connecticut,
1979.
Robert M. French. Subcognition and the limits of the Turing test. Mind, 99(393):53–65, 1990. Publisher: JSTOR.
Robert M. French. The Turing Test: the first 50 years. Trends in cognitive sciences, 4(3):115–122, 2000. Publisher:
Elsevier.
Richard Futrell, Ethan Wilcox, Takashi Morita, and Roger Levy. RNNs as psycholinguistic subjects: Syntactic state
and grammatical dependency. arXiv preprint arXiv:1809.01329, 2018.
H. L. Gallagher, F. Happé, N. Brunswick, P. C. Fletcher, U. Frith, and C. D. Frith. Reading the mind in cartoons and
stories: an fMRI study of ’theory of mind’ in verbal and nonverbal tasks. Neuropsychologia, 38(1):11–21, 2000.
ISSN 0028-3932. doi: 10.1016/s0028-3932(99)00053-6.
Jon Gauthier, Jennifer Hu, Ethan Wilcox, Peng Qian, and Roger Levy. SyntaxGym: An Online Platform for Targeted
Evaluation of Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics: System Demonstrations, pages 70–76, Online, July 2020. Association for Computational Linguistics.
doi: 10.18653/v1/2020.acl-demos.10. URL https://aclanthology.org/2020.acl-demos.10.
L.A. Gerken. Decisions, decisions: Infant language learning when multiple generalizations are possible. Cognition, 98
(3):B67–B74, 2006.
Edward Gibson and Neal J. Pearlmutter. Constraints on sentence comprehension. Trends in cognitive sciences, 2(7):
262–268, 1998. Publisher: Elsevier.
Mario Giulianelli, Jack Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. Under the hood: Using
diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings
of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 240–248,
Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5426. URL
https://aclanthology.org/W18-5426.
Lila R. Gleitman. A human universal: the capacity to learn a language. Modern philology, 90:S13–S33, 1993. Publisher:
University of Chicago Press.
J. Gläscher, D. Rudrauf, R. Colom, L. K. Paul, D. Tranel, H. Damasio, and R. Adolphs. Distributed neural system
for general intelligence revealed by lesion mapping. Proceedings of the National Academy of Sciences, 107
(10):4705–4709, March 2010. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.0910397107. URL https:
//www.pnas.org/content/107/10/4705. Publisher: National Academy of Sciences Section: Biological Sciences.

29
A PREPRINT - JANUARY 18, 2023

Adele E Goldberg. Constructions work. Walter de Gruyter GmbH & Co. KG, 2009.
Adele E Goldberg. Explain me this: Creativity, competition, and the partial productivity of constructions. Princeton
University Press, 2019.
Ariel Goldstein, Zaid Zada, Eliav Buchnik, Mariano Schain, Amy Price, Bobbi Aubrey, Samuel A. Nastase, Amir
Feder, Dotan Emanuel, Alon Cohen, Aren Jansen, Harshvardhan Gazula, Gina Choe, Aditi Rao, Catherine Kim,
Colton Casto, Lora Fanda, Werner Doyle, Daniel Friedman, Patricia Dugan, Lucia Melloni, Roi Reichart, Sasha
Devore, Adeen Flinker, Liat Hasenfratz, Omer Levy, Avinatan Hassidim, Michael Brenner, Yossi Matias, Kenneth A.
Norman, Orrin Devinsky, and Uri Hasson. Shared computational principles for language processing in humans
and deep language models. Nature Neuroscience, 25(3):369–380, March 2022. ISSN 1546-1726. doi: 10.1038/
s41593-022-01026-4. URL https://www.nature.com/articles/s41593-022-01026-4. Number: 3 Publisher:
Nature Publishing Group.
Jonathan Gordon and Benjamin Van Durme. Reporting bias and knowledge acquisition. In Proceedings of the 2013
workshop on Automated knowledge base construction, pages 25–30, 2013.
Anirudh Goyal, Aniket Didolkar, Alex Lamb, Kartikeya Badola, Nan Rosemary Ke, Nasim Rahaman, Jonathan Binas,
Charles Blundell, Michael Mozer, and Yoshua Bengio. Coordination Among Neural Modules Through a Shared
Global Workspace, March 2022. URL http://arxiv.org/abs/2103.01197. arXiv:2103.01197 [cs, stat].
Gabriel Grand, Idan Asher Blank, Francisco Pereira, and Evelina Fedorenko. Semantic projection recovers rich human
knowledge of multiple object features from word embeddings. Nature Human Behaviour, pages 1–13, 2022.
H. P. Grice. Utterer’s meaning and intention. The philosophical review, 78(2):147–177, 1969.
H.P. Grice. Logic and conversation. In Peter Cole and Jerry L. Morgan, editors, Syntax and Semantics, Vol. 3, Speech
Acts, pages 41–58. Academic Press, New York, 1975.
Myrto Grigoroglou and Patricia A. Ganea. Language as a mechanism for reasoning about possibilities. Philosophical
Transactions of the Royal Society B: Biological Sciences, 377(1866):20210334, December 2022. doi: 10.1098/
rstb.2021.0334. URL https://royalsocietypublishing.org/doi/abs/10.1098/rstb.2021.0334. Publisher:
Royal Society.
Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni. Colorless green recurrent
networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1195–
1205, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1108.
URL https://aclanthology.org/N18-1108.
Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and Yinfei Yang. Longt5:
Efficient text-to-text transformer for long sequences. arXiv preprint arXiv:2112.07916, 2021.
Olga Gurevich, Matthew A. Johnson, and Adele E. Goldberg. Incidental verbatim memory for language. Language
and Cognition, 2(1):45–78, May 2010. ISSN 1866-9859. doi: 10.1515/langcog.2010.003. URL https://
www.degruyter.com/document/doi/10.1515/langcog.2010.003/html. Publisher: De Gruyter Mouton Section:
Language and Cognition.
Peter Hagoort and Stephen C. Levinson. Neuropragmatics. In The cognitive neurosciences, 5th ed, pages 667–674.
MIT Press, Cambridge, MA, US, 2014. ISBN 978-0-262-02777-9.
Demis Hassabis and Eleanor A. Maguire. The construction system of the brain. Philosophical Transactions of the Royal
Society B: Biological Sciences, 364(1521):1263–1271, May 2009. ISSN 0962-8436. doi: 10.1098/rstb.2008.0296.
URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2666702/.
Miriam Hauptman, Idan Blank, and Evelina Fedorenko. Non-literal language processing is jointly supported by the
language and Theory of Mind networks: Evidence from a novel meta-analytic fMRI approach, March 2022. URL
https://www.biorxiv.org/content/10.1101/2022.03.08.481056v1. Pages: 2022.03.08.481056 Section: New
Results.
Micha Heilbron, Kristijan Armeni, Jan-Mathijs Schoffelen, Peter Hagoort, and Floris P. de Lange. A hierarchy of
linguistic predictions during natural language comprehension. Proceedings of the National Academy of Sciences,
119(32):e2201968119, August 2022. doi: 10.1073/pnas.2201968119. URL https://www.pnas.org/doi/abs/10.
1073/pnas.2201968119. Publisher: Proceedings of the National Academy of Sciences.
Chadi Helwe, Chloé Clavel, and Fabian M Suchanek. Reasoning with transformer-based models: Deep learning, but
shallow reasoning. In 3rd Conference on Automated Knowledge Base Construction, 2021.
John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on

30
A PREPRINT - JANUARY 18, 2023

Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China, November 2019. Association
for Computational Linguistics. doi: 10.18653/v1/D19-1275. URL https://aclanthology.org/D19-1275.
John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. In Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota, June
2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1419. URL https://aclanthology.org/
N19-1419.
Wolfram Hinzen. Narrow syntax and the language of thought. Philosophical Psychology, 26(1):1–23, February
2013. ISSN 0951-5089. doi: 10.1080/09515089.2011.627537. URL https://doi.org/10.1080/09515089.2011.
627537. Publisher: Routledge _eprint: https://doi.org/10.1080/09515089.2011.627537.
Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox, and Roger Levy. A Systematic Assessment of Syntactic
Generalization in Neural Language Models. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 1725–1744, Online, July 2020. Association for Computational Linguistics. URL
https://www.aclweb.org/anthology/2020.acl-main.158.
Jennifer Hu, Sammy Floyd, Olessia Jouravlev, Evelina Fedorenko, and Edward Gibson. A fine-grained comparison of
pragmatic language understanding in humans and language models. arXiv preprint arXiv:2212.06801, 2022a.
Jennifer Hu, Hannah Small, Hope Kean, Atsushi Takahashi, Leo Zekelman, Daniel Kleinman, Elizabeth Ryan, Alfonso
Nieto-Castañón, Victor Ferreira, and Evelina Fedorenko. Precision fMRI reveals that the language-selective network
supports both phrase-structure building and lexical access during language production. Cerebral Cortex, page
bhac350, September 2022b. ISSN 1047-3211. doi: 10.1093/cercor/bhac350. URL https://doi.org/10.1093/
cercor/bhac350.
James Y. Huang, Kuan-Hao Huang, and Kai-Wei Chang. Disentangling Semantics and Syntax in Sentence Embeddings
with Pre-trained Language Models. In NAACL. arXiv, April 2021. doi: 10.48550/arXiv.2104.05115. URL
http://arxiv.org/abs/2104.05115. arXiv:2104.05115 [cs].
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos QA: Machine reading comprehension
with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-
IJCNLP), pages 2391–2401, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:
10.18653/v1/D19-1243. URL https://aclanthology.org/D19-1243.
Anne H Charity Hudley and Christine Mallinson. Understanding English language variation in US schools. Teachers
College Press, 2015.
Anne H Charity Hudley, Christine Mallinson, and Mary Bucholtz. Toward racial justice in linguistics: Interdisciplinary
insights into theorizing race in the discipline and diversifying the profession. Language, 96(4):e200–e235, 2020.
Publisher: Linguistic Society of America.
Drew Hudson and Christopher D Manning. Learning by abstraction: The neural state machine. In Advances in Neural
Information Processing Systems, pages 5901–5914, 2019.
Philip A Huebner and Jon A Willits. Structured semantic knowledge can emerge automatically from predicting word
sequences in child-directed speech. Frontiers in Psychology, 9:133, 2018.
Philip A. Huebner, Elior Sulem, Fisher Cynthia, and Dan Roth. BabyBERTa: Learning More Grammar With Small-Scale
Child-Directed Language. In Proceedings of the 25th Conference on Computational Natural Language Learning,
pages 624–646, 2021.
Colin Humphries, Jeffrey R. Binder, David A. Medler, and Einat Liebenthal. Syntactic and Semantic Modulation
of Neural Activity during Auditory Sentence Comprehension. Journal of cognitive neuroscience, 18(4):665–679,
April 2006. ISSN 0898-929X. doi: 10.1162/jocn.2006.18.4.665. URL https://www.ncbi.nlm.nih.gov/pmc/
articles/PMC1635792/.
Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: how do neural networks
generalise? Journal of Artificial Intelligence Research, 67:757–795, 2020.
Nicholas Ichien, Qing Liu, Shuhao Fu, Keith J Holyoak, Alan Yuille, and Hongjing Lu. Visual analogy: Deep learning
versus compositional models. arXiv preprint arXiv:2105.07065, 2021.
Leyla Isik, Kami Koldewyn, David Beeler, and Nancy Kanwisher. Perceiving social interactions in the posterior superior
temporal sulcus. Proceedings of the National Academy of Sciences, 114(43):E9145–E9152, October 2017. ISSN
0027-8424, 1091-6490. doi: 10.1073/pnas.1714471114. URL https://www.pnas.org/content/114/43/E9145.

31
A PREPRINT - JANUARY 18, 2023

Anna A. Ivanova, Shashank Srikant, Yotaro Sueoka, Hope H. Kean, Riva Dhamala, Una-May O’Reilly, Marina U. Bers,
and Evelina Fedorenko. Comprehension of computer code relies primarily on domain-general executive resources.
bioRxiv, page 2020.04.16.045732, April 2020. doi: 10.1101/2020.04.16.045732. URL https://www.biorxiv.org/
content/10.1101/2020.04.16.045732v1. Publisher: Cold Spring Harbor Laboratory Section: New Results.
Anna A. Ivanova, John Hewitt, and Noga Zaslavsky. Probing artificial neural networks: insights from neuroscience. In
ICLR 2021 Workshop: How Can Findings About The Brain Improve AI Systems?, 2021a. _eprint: 2104.08197.
Anna A. Ivanova, Zachary Mineroff, Vitor Zimmerer, Nancy Kanwisher, Rosemary Varley, and Evelina Fedorenko. The
Language Network is Recruited but Not Required for Nonverbal Event Semantics. Neurobiology of Language, pages
1–26, January 2021b. doi: 10.1162/nol_a_00030. URL https://doi.org/10.1162/nol_a_00030. Publisher: MIT
Press.
R. Jackendoff and S. Pinker. The nature of the language faculty and its implications for evolution of language (reply to
fitch, hauser, and chomsky). Cognition, 97(2):211–225, 2005.
Ray Jackendoff. Foundations of language: brain, meaning, grammar, evolution, 2002.
Ray Jackendoff and Ray S. Jackendoff. Foundations of language: Brain, meaning, grammar, evolution. Oxford
University Press, USA, 2002.
Nir Jacoby and Evelina Fedorenko. Discourse-level comprehension engages medial frontal Theory of Mind brain regions
even for expository texts. Language, Cognition and Neuroscience, 35(6):780–796, July 2020. ISSN 2327-3798.
doi: 10.1080/23273798.2018.1525494. URL https://doi.org/10.1080/23273798.2018.1525494. Publisher:
Routledge _eprint: https://doi.org/10.1080/23273798.2018.1525494.
Nir Jacoby, Emile Bruneau, Jorie Koster-Hale, and Rebecca Saxe. Localizing Pain Matrix and Theory of Mind
networks with both verbal and non-verbal stimuli. NeuroImage, 126:39–48, February 2016. ISSN 1053-8119. doi:
10.1016/j.neuroimage.2015.11.025. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4733571/.
Gijeong Jang, Shin-ae Yoon, Sung-Eun Lee, Haeil Park, Joohan Kim, Jeong Hoon Ko, and Hae-Jeong Park. Everyday
conversation requires cognitive inference: Neural bases of comprehending implicated meanings in conversations.
NeuroImage, 81:61–72, November 2013. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2013.05.027. URL https:
//www.sciencedirect.com/science/article/pii/S1053811913005211.
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and
Ross Girshick. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International
Conference on Computer Vision, pages 2989–2998, 2017.
Olessia Jouravlev, David Zheng, Zuzanna Balewski, Alvince Le Arnz Pongos, Zena Levan, Susan Goldin-Meadow,
and Evelina Fedorenko. Speech-accompanying gestures are not processed by the language-processing mechanisms.
Neuropsychologia, 132:107132, September 2019. ISSN 0028-3932. doi: 10.1016/j.neuropsychologia.2019.107132.
URL http://www.sciencedirect.com/science/article/pii/S0028393218304469.
Dan Jurafsky and James H Martin. Speech and language processing : an introduction to natural language processing,
computational linguistics, and speech recognition. Pearson Prentice Hall, Upper Saddle River, N.J., 2009a. ISBN
9780131873216 0131873210.
Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition. Pearson Prentice Hall, second edition, 2009b.
Masahiro Kaneko, Danushka Bollegala, and Naoaki Okazaki. Debiasing isn’t enough! – on the effectiveness of
debiasing MLMs and their social biases in downstream tasks. In Proceedings of the 29th International Conference on
Computational Linguistics, pages 1299–1310, Gyeongju, Republic of Korea, October 2022. International Committee
on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.111.
N. Kanwisher, J. McDermott, and M. M. Chun. The fusiform face area: a module in human extrastriate cortex
specialized for face perception. The Journal of Neuroscience, 17(11):4302, 1997. ISSN 0270-6474.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
2020.
Nora Kassner and Hinrich Schütze. Negated and misprimed probes for pretrained language models: Birds can talk,
but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages
7811–7818, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.698.
URL https://aclanthology.org/2020.acl-main.698.
Caitlin Keenan. A pleasant three days in Philadelphia: Arguments for a pseudopartitive analysis. University of
Pennsylvania Working Papers in Linguistics, 19(1):11, 2013.

32
A PREPRINT - JANUARY 18, 2023

Huda Khayrallah and Philipp Koehn. On the impact of various types of noise on neural machine translation. In
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 74–83, Melbourne,
Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-2709. URL https:
//aclanthology.org/W18-2709.
Najoung Kim and Paul Smolensky. Testing for grammatical category abstraction in neural language models. In
Proceedings of the Society for Computation in Linguistics 2021, pages 467–470, Online, February 2021. Association
for Computational Linguistics. URL https://aclanthology.org/2021.scil-1.59.
Najoung Kim, Tal Linzen, and Paul Smolensky. Uncontrolled lexical exposure leads to overestimation of compositional
generalization in pretrained models. arXiv preprint arXiv:2212.10769, 2022.
Katherine D Kinzler. Language as a social cue. Annual review of psychology, 72:241–264, 2021. Publisher: Annual
Reviews.
Katherine D Kinzler, Kristin Shutts, Jasmine DeJesus, and Elizabeth S Spelke. Accent trumps race in guiding children’s
social preferences. Social cognition, 27(4):623–634, 2009. Publisher: Guilford Press.
Nicolai Klessinger, Marcin Szczerbinski, and Rosemary A. Varley. Algebra in a man with severe aphasia. Neuropsy-
chologia, 45(8):1642–1648, January 2007. ISSN 0028-3932. doi: 10.1016/j.neuropsychologia.2007.01.005. URL
https://www.sciencedirect.com/science/article/pii/S0028393207000280.
Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, and Orhan
Firat. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference, September 2021. URL http:
//arxiv.org/abs/2110.03742. arXiv:2110.03742 [cs].
Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark, and Phil Blunsom. LSTMs can learn
syntax-sensitive dependencies well, but modeling structure makes them better. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1426–1436, Melbourne,
Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1132. URL https:
//aclanthology.org/P18-1132.
G. R. Kuperberg, P. K. McGuire, E. T. Bullmore, M. J. Brammer, S. Rabe-Hesketh, I. C. Wright, D. J. Lythgoe,
S. C. R. Williams, and A. S. David. Common and Distinct Neural Substrates for Pragmatic, Semantic, and Syntactic
Processing of Spoken Sentences: An fMRI Study. Journal of Cognitive Neuroscience, 12(2):321–341, March 2000.
ISSN 0898-929X. doi: 10.1162/089892900562138. URL https://doi.org/10.1162/089892900562138.
Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. Measuring bias in contextualized word
representations. arXiv preprint arXiv:1906.07337, 2019.
William Labov. Sociolinguistics. A Survey of Linguistic Science, 1978.
Brenden M Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-
sequence recurrent networks. In International Conference on Machine Learning, pages 2879–2888, 2018.
Robin Lakoff. Language in context. Language, pages 907–927, 1972.
Yair Lakretz, German Kruszewski, Theo Desbordes, Dieuwke Hupkes, Stanislas Dehaene, and Marco Baroni. The
emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), pages 11–20, Minneapolis, Minnesota, June 2019. Association for Computational
Linguistics. doi: 10.18653/v1/N19-1002. URL https://aclanthology.org/N19-1002.
Yair Lakretz, Théo Desbordes, Dieuwke Hupkes, and Stanislas Dehaene. Causal Transformers Perform Below Chance
on Recursive Nested Constructions, Unlike Humans, October 2021. URL http://arxiv.org/abs/2110.07240.
arXiv:2110.07240 [cs].
Yair Lakretz, Théo Desbordes, Dieuwke Hupkes, and Stanislas Dehaene. Can transformers process recursive nested
constructions, like humans? In Proceedings of the 29th International Conference on Computational Linguistics, pages
3226–3232, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics.
URL https://aclanthology.org/2022.coling-1.285.
Andrew K Lampinen, Ishita Dasgupta, Stephanie CY Chan, Kory Matthewson, Michael Henry Tessler, Antonia
Creswell, James L McClelland, Jane X Wang, and Felix Hill. Can language models learn from explanations in
context? arXiv preprint arXiv:2204.02329, 2022.
Ronald W Langacker. A usage-based model. In Topics in cognitive linguistics, page 127. John Benjamins, 1988.
Ronald W Langacker. How not to disagree: The emergence of structure from usage. Language usage and language
structure, 213:107, 2010.

33
A PREPRINT - JANUARY 18, 2023

Shalom Lappin. Deep learning and linguistic representation. CRC Press, 2021.
Karim Lasri, Tiago Pimentel, Alessandro Lenci, Thierry Poibeau, and Ryan Cotterell. Probing for the usage of
grammatical number. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 8818–8831, Dublin, Ireland, May 2022a. Association for Computational Linguistics.
doi: 10.18653/v1/2022.acl-long.603. URL https://aclanthology.org/2022.acl-long.603.
Karim Lasri, Olga Seminck, Alessandro Lenci, and Thierry Poibeau. Subject verb agreement error patterns in
meaningless sentences: Humans vs. bert. arXiv preprint arXiv:2209.10538, 2022b.
Ryan Law and Liina Pylkkänen. Lists with and without syntax: A new approach to measuring the neural processing of
syntax. Journal of Neuroscience, January 2021. ISSN 0270-6474, 1529-2401. doi: 10.1523/JNEUROSCI.1179-20.
2021. URL https://www.jneurosci.org/content/early/2021/02/10/JNEUROSCI.1179-20.2021. Publisher:
Society for Neuroscience Section: Research Articles.
A. R. Lecours and Y. Joanette. Linguistic and other psychological aspects of paroxysmal aphasia. Brain and Language,
10(1):1–23, May 1980. ISSN 0093-934X. doi: 10.1016/0093-934x(80)90034-6.
Lillian Lee. " i’m sorry dave, i’m afraid i can’t do that": Linguistics, statistics, and natural language processing circa
2001. arXiv preprint cs/0304027, 2003a.
Lillian Lee. " I’m sorry Dave, I’m afraid I can’t do that": Linguistics, Statistics, and Natural Language Processing circa
2001. arXiv preprint cs/0304027, 2003b.
Haemy Lee Masson and Leyla Isik. Functional selectivity for social interaction perception in the human su-
perior temporal sulcus during natural viewing. NeuroImage, 245:118741, December 2021. ISSN 1053-
8119. doi: 10.1016/j.neuroimage.2021.118741. URL https://www.sciencedirect.com/science/article/
pii/S1053811921010132.
Alessandro Lenci. Distributional semantics in linguistic and cognitive research. Italian journal of linguistics, 20(1):
1–31, 2008.
Yulia Lerner, Christopher J. Honey, Lauren J. Silbert, and Uri Hasson. Topographic Mapping of a Hierarchy of Temporal
Receptive Windows Using a Narrated Story. The Journal of Neuroscience, 31(8):2906–2915, February 2011. ISSN
0270-6474. doi: 10.1523/JNEUROSCI.3684-10.2011. URL https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC3089381/.
Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd schema challenge. In Thirteenth International
Conference on the Principles of Knowledge Representation and Reasoning. Citeseer, 2012.
S.C. Levinson. Presumptive meanings: The theory of generalized conversational implicature. MIT Press, Cambridge,
MA, 2000. ISBN 0262122189.
Stephen C Levinson, Stephen C Levinson, and S Levinson. Pragmatics. Cambridge university press, 1983.
Molly Lewis and Gary Lupyan. Gender stereotypes are reflected in the distributional structure of 25 languages. Nature
human behaviour, 4(10):1021–1028, 2020.
Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh
Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning
problems with language models. In Advances in Neural Information Processing Systems, 2022.
Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially
private learners. arXiv preprint arXiv:2110.05679, 2021.
Tal Linzen. What can linguistics and deep learning contribute to each other? Response to Pater. Language, 95(1):
e99–e108, 2019. Publisher: Linguistic Society of America.
Tal Linzen and Marco Baroni. Syntactic Structure from Deep Learning. Annual Reviews of Linguistics, 2021.
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the ability of lstms to learn syntax-sensitive dependencies.
Transactions of the Association for Computational Linguistics, 4:521–535, 2016.
Qi Liu, Dani Yogatama, and Phil Blunsom. Relational Memory-Augmented Language Models. Transactions of the
Association for Computational Linguistics, 10:555–572, May 2022a. ISSN 2307-387X. doi: 10.1162/tacl_a_00476.
URL https://doi.org/10.1162/tacl_a_00476.
Xiao Liu, Da Yin, Yansong Feng, and Dongyan Zhao. Things not written in text: Exploring spatial commonsense from
visual signals. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 2365–2376, Dublin, Ireland, May 2022b. Association for Computational Linguistics. doi:
10.18653/v1/2022.acl-long.168. URL https://aclanthology.org/2022.acl-long.168.

34
A PREPRINT - JANUARY 18, 2023

Y. Liu, J. Kim, C. Wilson, and M. Bedny. Computer code comprehension shares neural resources with formal logical
inference in the fronto-parietal network. bioRxiv, page 2020.05.24.096180, June 2020. doi: 10.1101/2020.05.24.
096180. URL https://www.biorxiv.org/content/10.1101/2020.05.24.096180v3. Publisher: Cold Spring
Harbor Laboratory Section: New Results.
Joao Loula, Marco Baroni, and Brenden M Lake. Rearranging the familiar: Testing compositional generalization in
recurrent networks. arXiv preprint arXiv:1807.07545, 2018.
Charles Lovering and Ellie Pavlick. Unit testing for concepts in neural networks. Transactions of the Association for
Computational Linguistics, 10:1193–1208, 2022.
Kaiji Lu, Piotr Mardziel, Klas Leino, Matt Fredrikson, and Anupam Datta. Influence paths for characterizing subject-
verb number agreement in LSTM language models. In Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 4748–4757, Online, July 2020. Association for Computational Linguistics. doi:
10.18653/v1/2020.acl-main.430. URL https://aclanthology.org/2020.acl-main.430.
Li Lucy and Jon Gauthier. Are distributional representations ready for the real world? evaluating word vectors for
grounded perceptual meaning. In Proceedings of the First Workshop on Language Grounding for Robotics, pages
76–85, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-2810.
URL https://aclanthology.org/W17-2810.
A. R. Luria, L. S. Tsvetkova, and D. S. Futer. Aphasia in a composer (V. G. Shebalin). Journal of the Neurological
Sciences, 2(3):288–292, June 1965. ISSN 0022-510X. doi: 10.1016/0022-510x(65)90113-9.
Maryellen C. MacDonald. The interaction of lexical and syntactic ambiguity. Journal of memory and language, 32(5):
692–715, 1993a. Publisher: Elsevier.
Maryellen C. MacDonald. The interaction of lexical and syntactic ambiguity. Journal of memory and language, 32(5):
692–715, 1993b. Publisher: Elsevier.
Maryellen C. MacDonald. How language production shapes language form and comprehension. Frontiers in psychology,
4:226, 2013. Publisher: Frontiers Media SA.
Maryellen C. MacDonald, Marcel Adam Just, and Patricia A. Carpenter. Working memory constraints on the processing
of syntactic ambiguity. Cognitive psychology, 24(1):56–98, 1992. Publisher: Elsevier.
Maryellen C. MacDonald, Neal J. Pearlmutter, and Mark S. Seidenberg. The lexical nature of syntactic ambiguity
resolution. Psychological review, 101(4):676, 1994a. Publisher: American Psychological Association.
Maryellen C. MacDonald, Neal J. Pearlmutter, and Mark S. Seidenberg. The lexical nature of syntactic ambiguity
resolution. Psychological review, 101(4):676, 1994b. Publisher: American Psychological Association.
Mairéad MacSweeney, Bencie Woll, Ruth Campbell, Philip K. McGuire, Anthony S. David, Steven C. R. Williams,
John Suckling, Gemma A. Calvert, and Michael J. Brammer. Neural systems underlying British Sign Language
and audio-visual English processing in native users. Brain, 125(7):1583–1593, July 2002. ISSN 0006-8950. doi:
10.1093/brain/awf153. URL https://doi.org/10.1093/brain/awf153.
Brian MacWhinney and B. MacWhinney. The competition model. Mechanisms of language acquisition, pages 249–308,
1987.
Christopher D. Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. Emergent linguistic
structure in artificial neural networks trained by self-supervision. Proceedings of the National Academy of Sciences,
117(48):30046–30054, December 2020. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1907367117. URL
http://www.pnas.org/lookup/doi/10.1073/pnas.1907367117.
Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner:
Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584, 2019.
Gary Marcus. The next decade in ai: Four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177,
2020.
Gary Marcus, Francesca Rossi, and Manuela Veloso. Beyond the turing test. Ai Magazine, 37(1):3–4, 2016.
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé
Seddah, and Benoît Sagot. CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, pages 7203–7219, Online, July 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.acl-main.645. URL https://aclanthology.org/2020.acl-main.645.
Rebecca Marvin and Tal Linzen. Targeted syntactic evaluation of language models. In Proceedings of the
2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Bel-
gium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1151. URL
https://aclanthology.org/D18-1151.

35
A PREPRINT - JANUARY 18, 2023

William Matchin and Emily Wood. Syntax-Sensitive Regions of the Posterior Inferior Frontal Gyrus and the Posterior
Temporal Lobe Are Differentially Recruited by Production and Perception. Cerebral Cortex Communications, 1
(1):tgaa029, January 2020. ISSN 2632-7376. doi: 10.1093/texcom/tgaa029. URL https://doi.org/10.1093/
texcom/tgaa029.
William Matchin, Christopher Mathias Hammerly, and Ellen F. Lau. The role of the IFG and pSTS in syntactic
prediction: Evidence from a parametric study of hierarchical structure in fMRI. Cortex, 88:106–123, 2017. doi:
10.1016/j.cortex.2016.12.010.
William Matchin, Christian Brodbeck, Christopher Hammerly, and Ellen Lau. The temporal dynamics of structure and
content in sentence comprehension: Evidence from fMRI-constrained MEG. Human Brain Mapping, 40(2):663–678,
February 2019. ISSN 1097-0193. doi: 10.1002/hbm.24403.
Rowan Hall Maudslay and Ryan Cotterell. Do syntactic probes probe syntax? experiments with jabberwocky probing. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 124–131, Online, June 2021. Association for Computational Linguistics. doi:
10.18653/v1/2021.naacl-main.11. URL https://aclanthology.org/2021.naacl-main.11.
R Thomas McCoy, Robert Frank, and Tal Linzen. Revisiting the poverty of the stimulus: hierarchical generalization
without a hierarchical bias in recurrent neural networks. arXiv preprint arXiv:1802.09091, 2018.
R Thomas McCoy, Robert Frank, and Tal Linzen. Does syntax need to grow on trees? sources of hierarchical inductive
bias in sequence-to-sequence networks. arXiv preprint arXiv:2001.03632, 2020.
R Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz. How much do language
models copy from their training data? evaluating linguistic novelty in text generation using raven. arXiv preprint
arXiv:2111.09509, 2021a.
R. Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz. How much do language models
copy from their training data? Evaluating linguistic novelty in text generation using RAVEN, 2021b. _eprint:
2111.09509.
Tom McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural
language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
pages 3428–3448, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1334.
URL https://aclanthology.org/P19-1334.
Markus Meister. Learning, fast and slow. Current Opinion in Neurobiology, 75:102555, 2022.
Laura Menenti, Sarah M. E. Gierhan, Katrien Segaert, and Peter Hagoort. Shared language: overlap and segregation of
the neuronal infrastructure for speaking and listening revealed by functional MRI. Psychological Science, 22(9):
1173–1182, September 2011. ISSN 1467-9280. doi: 10.1177/0956797611418347.
William Merrill, Alex Warstadt, and Tal Linzen. Entailment semantics can be extracted from an ideal language model,
2022. URL https://arxiv.org/abs/2209.12407.
M. M. Mesulam. Primary progressive aphasia. Annals of Neurology, 49(4):425–432, April 2001. ISSN 0364-5134.
M.-Marsel Mesulam, Emily J. Rogalski, Christina Wieneke, Robert S. Hurley, Changiz Geula, Eileen H. Bigio,
Cynthia K. Thompson, and Sandra Weintraub. Primary progressive aphasia and the evolving neurology of the
language network. Nature Reviews. Neurology, 10(10):554–569, October 2014. ISSN 1759-4766. doi: 10.1038/
nrneurol.2014.159.
David Meunier, Renaud Lambiotte, and Edward T. Bullmore. Modular and hierarchically modular organization of brain
networks. Frontiers in neuroscience, 4:200, 2010. Publisher: Frontiers.
Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. A Diverse Corpus for Evaluating and Developing English Math
Word Problem Solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
pages 975–984, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.92.
URL https://aclanthology.org/2020.acl-main.92.
Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, and Jason Eisner. What kind of language is hard to
language-model? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages
4975–4989, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1491. URL
https://aclanthology.org/P19-1491.
Kanishka Misra, Julia Taylor Rayz, and Allyson Ettinger. Comps: Conceptual minimal pair sentences for testing
property knowledge and inheritance in pre-trained language models. arXiv preprint arXiv:2210.01963, 2022.

36
A PREPRINT - JANUARY 18, 2023

Martin M. Monti, Daniel N. Osherson, Michael J. Martinez, and Lawrence M. Parsons. Functional neuroanatomy of
deductive inference: A language-independent distributed network. NeuroImage, 37(3):1005–1016, September 2007.
ISSN 1053-8119. doi: 10.1016/j.neuroimage.2007.04.069. URL http://www.sciencedirect.com/science/
article/pii/S1053811907003436.
Martin M. Monti, Lawrence M. Parsons, and Daniel N. Osherson. The boundaries of language and thought in deductive
inference. Proceedings of the National Academy of Sciences, 106(30):12554–12559, July 2009. ISSN 0027-8424,
1091-6490. doi: 10.1073/pnas.0902422106. URL http://www.pnas.org/cgi/doi/10.1073/pnas.0902422106.
Martin M. Monti, Lawrence M. Parsons, and Daniel N. Osherson. Thought beyond language: neural dissociation
of algebra and natural language. Psychological Science, 23(8):914–922, August 2012. ISSN 1467-9280. doi:
10.1177/0956797612437427.
James H. Moor. An analysis of the Turing test. Philosophical Studies: An International Journal for Philosophy in the
Analytic Tradition, 30(4):249–257, 1976. Publisher: JSTOR.
Aaron Mueller, Yu Xia, and Tal Linzen. Causal analysis of syntactic agreement neurons in multilingual language
models. arXiv preprint arXiv:2210.14328, 2022.
Norman L Munn. Handbook of psychological research on the rat; an introduction to animal psychology. Houghton
Mifflin, 1950.
Matthew J. Nelson, Imen El Karoui, Kristof Giber, Xiaofang Yang, Laurent Cohen, Hilda Koopman, Sydney S.
Cash, Lionel Naccache, John T. Hale, Christophe Pallier, and Stanislas Dehaene. Neurophysiological dynamics of
phrase-structure building during sentence processing. Proceedings of the National Academy of Sciences, 114(18):
E3669–E3678, May 2017. doi: 10.1073/pnas.1701590114. URL https://www.pnas.org/doi/10.1073/pnas.
1701590114. Publisher: Proceedings of the National Academy of Sciences.
Peter Norvig. Colorless green ideas learn furiously: Chomsky and the two cultures of statistical learning. Significance,
9(4):30–33, 2012.
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David
Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate
computation with language models. arXiv preprint arXiv:2112.00114, 2021.
Timothy J O’Donnell. Productivity and reuse in language: A theory of linguistic computation and storage. MIT Press,
2015.
T.J. O’Donnell. Productivity and reuse in language. PhD thesis, Harvard University, 2011.
OpenAI. Aligning Language Models to Follow Instructions, January 2022a. URL https://openai.com/blog/
instruction-following/.
OpenAI. ChatGPT: Optimizing Language Models for Dialogue, November 2022b. URL https://openai.com/blog/
chatgpt/.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens,
Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow
instructions with human feedback, March 2022. URL http://arxiv.org/abs/2203.02155. arXiv:2203.02155
[cs].
Christophe Pallier, Anne-Dominique Devauchelle, and Stanislas Dehaene. Cortical representation of the constituent
structure of sentences. Proceedings of the National Academy of Sciences, 108(6):2522–2527, February 2011. doi: 10.
1073/pnas.1018711108. URL https://www.pnas.org/doi/10.1073/pnas.1018711108. Publisher: Proceedings
of the National Academy of Sciences.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP Models really able to Solve Simple Math Word Problems?
In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages 2080–2094, Online, June 2021. Association for Computational
Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168.
Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. In International Conference
on Learning Representations, 2021.
Joe Pater. Generative linguistics and neural networks at 60: Foundation, friction, and fusion. Language, 95(1):e41–e74,
2019. Publisher: Linguistic Society of America.
Nikole D Patson and E Matthew Husband. Misinterpretations in agreement and agreement attraction. Quarterly Journal
of Experimental Psychology, 69(5):950–971, 2016.

37
A PREPRINT - JANUARY 18, 2023

Karalyn Patterson, Peter J. Nestor, and Timothy T. Rogers. Where do you know what you know? The representation of
semantic knowledge in the human brain. Nature Reviews. Neuroscience, 8(12):976–987, December 2007. ISSN
1471-0048. doi: 10.1038/nrn2277.
Alexander M. Paunov, Idan A. Blank, and Evelina Fedorenko. Functionally distinct language and Theory of Mind
networks are synchronized at rest and during language comprehension. Journal of Neurophysiology, 121(4):1244–
1265, April 2019. ISSN 0022-3077, 1522-1598. doi: 10.1152/jn.00619.2018. URL https://www.physiology.
org/doi/10.1152/jn.00619.2018.
Alexander M. Paunov, Idan A. Blank, Olessia Jouravlev, Zachary Mineroff, Jeanne Gallée, and Evelina Fedorenko.
Differential Tracking of Linguistic vs. Mental State Content in Naturalistic Stimuli by Language and Theory
of Mind (ToM) Brain Networks. Neurobiology of Language, pages 1–29, June 2022. ISSN 2641-4368. doi:
10.1162/nol_a_00071. URL https://doi.org/10.1162/nol_a_00071.
Amy Perfors, Joshua B. Tenenbaum, and Terry Regier. The learnability of abstract syntactic principles. Cognition,
118(3):306–338, March 2011. ISSN 00100277. doi: 10.1016/j.cognition.2010.11.001. URL http://linkinghub.
elsevier.com/retrieve/pii/S0010027710002593.
Fabio Petroni, Tim Rocktà Cschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander
Miller. Language Models as Knowledge Bases? In Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-
IJCNLP), pages 2463–2473, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:
10.18653/v1/D19-1250. URL https://www.aclweb.org/anthology/D19-1250.
Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine
Jernite, Vladimir Karpukhin, Jean Maillard, et al. Kilt: a benchmark for knowledge intensive language tasks. In
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 2523–2544, 2021.
Steven T. Piantadosi and Felix Hill. Meaning without reference in large language models, 2022. URL https:
//arxiv.org/abs/2208.02957.
Tiago Pimentel and Ryan Cotterell. A Bayesian framework for information-theoretic probing. In Proceedings of
the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2869–2887, Online and Punta
Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.
emnlp-main.229. URL https://aclanthology.org/2021.emnlp-main.229.
Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. Information-
theoretic probing for linguistic structure. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 4609–4622, Online, July 2020. Association for Computational Linguistics. doi:
10.18653/v1/2020.acl-main.420. URL https://aclanthology.org/2020.acl-main.420.
Ayse Pinar Saygin, Ilyas Cicekli, and Varol Akman. Turing test: 50 years later. Minds and machines, 10(4):463–518,
2000. Publisher: Springer.
Philippe Pinel and Stanislas Dehaene. Beyond Hemispheric Dominance: Brain Regions Underlying the Joint Lateraliza-
tion of Language and Arithmetic to the Left Hemisphere. Journal of Cognitive Neuroscience, 22(1):48–66, January
2009. ISSN 0898-929X. doi: 10.1162/jocn.2009.21184. URL https://doi.org/10.1162/jocn.2009.21184.
S. Pinker. Words and rules: The ingredients of language. Harper Perennial, 2000.
S. Pinker and P. Bloom. Natural language and natural selection. Behavioral and brain sciences, 13:707–784, 1990.
S. Pinker and R. Jackendoff. The faculty of language: what’s special about it? Cognition, 95(2):201–236, 2005.
Steven Pinker and Alan Prince. On language and connectionism: Analysis of a parallel distributed processing model of
language acquisition. Cognition, 28(1-2):73–193, 1988. Publisher: Elsevier.
Telmo Pires, Eva Schlinger, and Dan Garrette. How multilingual is multilingual BERT? In Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Florence, Italy, July
2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1493. URL https://aclanthology.org/
P19-1493.
Mary C. Potter and Linda Lombardi. Syntactic priming in immediate recall of sentences. Journal of Mem-
ory and Language, 38(3):265–282, 1998. URL http://www.sciencedirect.com/science/article/pii/
S0749596X97925468.
C Potts. Is it possible for language models to achieve language understanding. Medium-online publication, 2020.

38
A PREPRINT - JANUARY 18, 2023

RT Pramod, Michael A Cohen, Joshua B Tenenbaum, and Nancy Kanwisher. Invariant representation of physical
stability in the human brain. eLife, 11:e71736, May 2022. ISSN 2050-084X. doi: 10.7554/eLife.71736. URL
https://doi.org/10.7554/eLife.71736. Publisher: eLife Sciences Publications, Ltd.
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the
compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022.
Brianna L. Pritchett, Caitlyn Hoeflin, Kami Koldewyn, Eyal Dechter, and Evelina Fedorenko. High-level language
processing regions are not engaged in action observation or imitation. Journal of Neurophysiology, 120(5):2555–2570,
August 2018. ISSN 0022-3077. doi: 10.1152/jn.00222.2018. URL https://www.physiology.org/doi/full/10.
1152/jn.00222.2018.
G. K. Pullum and B. C. Scholz. Empirical assessment of stimulus poverty arguments. The Linguistic Review, 18(1-2):
9–50, 2002. ISSN 0167-6318.
Jack Rae and Ali Razavi. Do transformers need deep long-range memory? In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, pages 7524–7529, 2020.
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah
Henderson, Roman Ring, Susannah Young, and others. Scaling language models: Methods, analysis & insights from
training gopher. arXiv preprint arXiv:2112.11446, 2021.
Shauli Ravfogel, Yoav Goldberg, and Francis Tyers. Can LSTM learn to capture agreement? the case of Basque. In
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,
pages 98–107, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/
W18-5412. URL https://aclanthology.org/W18-5412.
Shauli Ravfogel, Yoav Goldberg, and Tal Linzen. Studying the Inductive Biases of RNNs with Synthetic Variations of
Natural Languages. In Proceedings of NAACL-HLT, pages 3532–3542, 2019.
Shauli Ravfogel, Grusha Prasad, Tal Linzen, and Yoav Goldberg. Counterfactual interventions reveal the causal effect
of relative clause representations on agreement prediction. In Proceedings of the 25th Conference on Computational
Natural Language Learning, pages 194–209, Online, November 2021. Association for Computational Linguistics.
doi: 10.18653/v1/2021.conll-1.15. URL https://aclanthology.org/2021.conll-1.15.
Abhilasha Ravichander, Eduard Hovy, Kaheer Suleman, Adam Trischler, and Jackie Chi Kit Cheung. On the Systematic-
ity of Probing Contextualized Word Representations: The Case of Hypernymy in BERT. In Proceedings of the Ninth
Joint Conference on Lexical and Computational Semantics, pages 88–102, Barcelona, Spain (Online), December
2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.starsem-1.10.
Keith Rayner and Lyn Frazier. Selection mechanisms in reading lexically ambiguous words. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 15(5):779, 1989. Publisher: American Psychological Association.
Keith Rayner, Anne E. Cook, Barbara J. Juhasz, and Lyn Frazier. Immediate disambiguation of lexically ambiguous
words during reading: Evidence from eye movements. British Journal of Psychology, 97(4):467–482, 2006. Publisher:
Wiley Online Library.
Gabriel Recchia. Teaching autoregressive language models complex tasks by demonstration. arXiv preprint
arXiv:2109.02102, 2021.
Aniketh Janardhan Reddy and Leila Wehbe. Can fMRI reveal the representation of syntactic structure in the brain?
In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, editors, Advances in Neural
Information Processing Systems, volume 34, pages 9843–9856. Curran Associates, Inc., 2021. URL https:
//proceedings.neurips.cc/paper/2021/file/51a472c08e21aef54ed749806e3e6490-Paper.pdf.
Mor Regev, Christopher J. Honey, Erez Simony, and Uri Hasson. Selective and Invariant Neural Responses to Spoken
and Written Narratives. Journal of Neuroscience, 33(40):15978–15988, October 2013. ISSN 0270-6474, 1529-2401.
doi: 10.1523/JNEUROSCI.1580-13.2013. URL https://www.jneurosci.org/content/33/40/15978. Publisher:
Society for Neuroscience Section: Articles.
Tamar I. Regev, Josef Affourtit, Xuanyi Chen, Abigail E. Schipper, Leon Bergen, Kyle Mahowald, and Evelina
Fedorenko. High-level language brain regions are sensitive to sub-lexical regularities, June 2021. URL https://
www.biorxiv.org/content/10.1101/2021.06.11.447786v1. Pages: 2021.06.11.447786 Section: New Results.
Brett D Roads and Bradley C Love. Learning as the unsupervised alignment of conceptual systems. Nature Machine
Intelligence, 2(1):76–82, 2020.
Alberto Romero. GPT-3 Scared You? Meet Wu Dao 2.0: A Monster of
1.75 Trillion Parameters, June 2022. URL https://towardsdatascience.com/
gpt-3-scared-you-meet-wu-dao-2-0-a-monster-of-1-75-trillion-parameters-832cd83db484.

39
A PREPRINT - JANUARY 18, 2023

Eric Ronco and Peter J Gawthrop. Neural networks for modelling and control. Rapport Technique csc, 97008, 1997.
D. E. Rumelhart and J. L. McClelland. Parallel distributed processing. MIT Press, Cambridge, MA, 1986.
D. E. Rumelhart and J. L. McClelland. Learning the past tenses of English verbs: Implicit rules or parallel distributed
processing. Mechanisms of language acquisition, pages 195–248, 1987.
Eleanor M. Saffran. Aphasia and the Relationship of Language and Brain. Seminars in Neurology, 20(4):409–418, 2000.
ISSN 0271-8235, 1098-9021. doi: 10.1055/s-2000-13173. URL http://www.thieme-connect.de/DOI/DOI?10.
1055/s-2000-13173. Publisher: Copyright © 2000 by Thieme Medical Publishers, Inc., 333 Seventh Avenue, New
York, NY 10001, USA. Tel.: +1(212) 584-4662.
J.R. Saffran and E.D. Thiessen. Pattern induction by infant language learners. Developmental psychology, 39(3):484,
2003.
J.R. Saffran, R.N. Aslin, and E.L. Newport. Statistical learning by 8-month-old infants. Science, 274(5294):1926, 1996.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An Adversarial Winograd
Schema Challenge at Scale. arXiv:1907.10641 [cs], November 2019. URL http://arxiv.org/abs/1907.10641.
arXiv: 1907.10641.
Maarten Sap, Ronan LeBras, Daniel Fried, and Yejin Choi. Neural Theory-of-Mind? On the Limits of Social Intelligence
in Large LMs, October 2022. URL http://arxiv.org/abs/2210.13312. arXiv:2210.13312 [cs].
R. Saxe and N. Kanwisher. People thinking about thinking people. The role of the temporo-parietal junction in "theory
of mind". NeuroImage, 19(4):1835–1842, August 2003. ISSN 1053-8119. doi: 10.1016/s1053-8119(03)00230-1.
Rebecca Saxe. Uniquely human social cognition. Current Opinion in Neurobiology, 16(2):235–239, April 2006. ISSN
0959-4388. doi: 10.1016/j.conb.2006.03.001. URL https://www.sciencedirect.com/science/article/pii/
S0959438806000262.
Rebecca Saxe and Lindsey J. Powell. It’s the thought that counts: specific brain regions for one component of theory
of mind. Psychological Science, 17(8):692–699, August 2006. ISSN 0956-7976. doi: 10.1111/j.1467-9280.2006.
01768.x.
Rebecca Saxe, Joseph M. Moran, Jonathan Scholz, and John Gabrieli. Overlapping and non-overlapping brain regions for
theory of mind and self reflection in individual subjects. Social Cognitive and Affective Neuroscience, 1(3):229–234,
December 2006. ISSN 1749-5016. doi: 10.1093/scan/nsl034. URL https://doi.org/10.1093/scan/nsl034.
Martin Schrimpf, Idan Asher Blank, Greta Tuckute, Carina Kauf, Eghbal A. Hosseini, Nancy Kanwisher, Joshua B.
Tenenbaum, and Evelina Fedorenko. The neural architecture of language: Integrative modeling converges on
predictive processing. Proceedings of the National Academy of Sciences, 118(45), November 2021. ISSN 0027-
8424, 1091-6490. doi: 10.1073/pnas.2105646118. URL https://www.pnas.org/content/118/45/e2105646118.
Publisher: National Academy of Sciences Section: Biological Sciences.
Sebastian Schuster and Tal Linzen. When a sentence does not introduce a discourse entity, transformer-based
models still sometimes refer to it. In Proceedings of the 2022 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, pages 969–982, Seattle, United
States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.71. URL https:
//aclanthology.org/2022.naacl-main.71.
Sarah Schwettmann, Joshua B Tenenbaum, and Nancy Kanwisher. Invariant representations of mass in the human brain.
eLife, 8:e46619, December 2019. ISSN 2050-084X. doi: 10.7554/eLife.46619. URL https://doi.org/10.7554/
eLife.46619. Publisher: eLife Sciences Publications, Ltd.
Terri L. Scott, Jeanne Gallée, and Evelina Fedorenko. A new fun and robust version of an fMRI localizer for the
frontotemporal language system. Cognitive Neuroscience, 8(3):167–176, 2017. ISSN 1758-8936. doi: 10.1080/
17588928.2016.1201466.
Mark S. Seidenberg and Maryellen C. MacDonald. A probabilistic constraints approach to language acquisition and
processing. Cognitive science, 23(4):569–588, 1999. Publisher: Wiley Online Library.
Mark S. Seidenberg, Michael K. Tanenhaus, James M. Leiman, and Marie Bienkowski. Automatic access of the
meanings of ambiguous words in context: Some limitations of knowledge-based processing. Cognitive psychology,
14(4):489–537, 1982. Publisher: Elsevier.
Cory Shain, Idan Asher Blank, Marten van Schijndel, William Schuler, and Evelina Fedorenko. fMRI reveals language-
specific predictive coding during naturalistic sentence comprehension. Neuropsychologia, 138:107307, 2020. ISSN
1873-3514. doi: 10.1016/j.neuropsychologia.2019.107307.

40
A PREPRINT - JANUARY 18, 2023

Cory Shain, Hope Kean, Benjamin Lipkin, Josef Affourtit, Matthew Siegelman, Francis Mollica, and Evelina Fedorenko.
‘Constituent length’ effects in fMRI do not provide evidence for abstract syntactic processing, November 2021.
URL https://www.biorxiv.org/content/10.1101/2021.11.12.467812v1. Pages: 2021.11.12.467812 Section:
New Results.
Cory Shain, Idan A. Blank, Evelina Fedorenko, Edward Gibson, and William Schuler. Robust Effects of Working Mem-
ory Demand during Naturalistic Language Comprehension in Language-Selective Cortex. Journal of Neuroscience,
42(39):7412–7430, September 2022a. ISSN 0270-6474, 1529-2401. doi: 10.1523/JNEUROSCI.1894-21.2022. URL
https://www.jneurosci.org/content/42/39/7412. Publisher: Society for Neuroscience Section: Research
Articles.
Cory Shain, Alexander M. Paunov, Xuanyi Chen, Benjamin Lipkin, and Evelina Fedorenko. No evidence of theory
of mind reasoning in the human language network, July 2022b. URL https://www.biorxiv.org/content/10.
1101/2022.07.18.500516v1. Pages: 2022.07.18.500516 Section: New Results.
Murray Shanahan. Talking About Large Language Models, December 2022. URL http://arxiv.org/abs/2212.
03551. arXiv:2212.03551 [cs].
Lauren J. Silbert, Christopher J. Honey, Erez Simony, David Poeppel, and Uri Hasson. Coupled neural systems
underlie the production and comprehension of naturalistic narrative speech. Proceedings of the National Academy of
Sciences, 111(43):E4687–E4696, October 2014. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1323812111. URL
https://www.pnas.org/content/111/43/E4687. Publisher: National Academy of Sciences Section: PNAS Plus.
Erez Simony, Christopher J. Honey, Janice Chen, Olga Lositsky, Yaara Yeshurun, Ami Wiesel, and Uri Hasson. Dynamic
reconfiguration of the default mode network during narrative comprehension. Nature Communications, 7(1):1–13, July
2016. ISSN 2041-1723. doi: 10.1038/ncomms12141. URL https://www.nature.com/articles/ncomms12141.
Number: 1 Publisher: Nature Publishing Group.
Dan I Slobin. From “thought and language” to “thinking for speaking”. 1996.
Tineke M. Snijders, Theo Vosse, Gerard Kempen, Jos J. A. Van Berkum, Karl Magnus Petersson, and Peter Hagoort.
Retrieval and unification of syntactic structure in sentence comprehension: an FMRI study using word-category
ambiguity. Cerebral Cortex (New York, N.Y.: 1991), 19(7):1493–1503, July 2009. ISSN 1460-2199. doi: 10.1093/
cercor/bhn187.
Stephanie Solt. Two types of modified cardinals. In International Conference on Adjectives. Lille, 2007.
Ben Sorscher, Surya Ganguli, and Haim Sompolinsky. The geometry of concept learning. bioRxiv, 2021.
E. Spelke. Core Knowledge. In N. Kanwisher and J. Duncan, editors, Attention and perforamnce, vol. 20: Functional
neuroimaging of visual cognition. Oxford University Press, Oxford, 2004.
Nicola Spotorno, Eric Koun, Jérôme Prado, Jean-Baptiste Van Der Henst, and Ira A. Noveck. Neural evidence
that utterance-processing entails mentalizing: the case of irony. NeuroImage, 63(1):25–39, October 2012. ISSN
1095-9572. doi: 10.1016/j.neuroimage.2012.06.046.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R
Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and
extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
Alon Talmor, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. oLMpics-on what language model pre-training
captures. Transactions of the Association for Computational Linguistics, 8:743–758, 2020a. doi: 10.1162/tacl_a_
00342. URL https://aclanthology.org/2020.tacl-1.48.
Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, and Jonathan Berant. Leap-of-thought: Teaching pre-trained
models to systematically reason over implicit knowledge. Advances in Neural Information Processing Systems, 33:
20227–20237, 2020b.
M. K. Tanenhaus, M. J. Spivey-Knowlton, K. M. Eberhard, and J. C. Sedivy. Integration of visual and linguistic
information in spoken language comprehension. Science, 268(5217):1632, 1995.
Leyla Tarhan and Talia Konkle. Sociality and interaction envelope organize visual action representations. Nature
Communications, 11(1):3002, June 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-16846-w. URL https:
//www.nature.com/articles/s41467-020-16846-w. Number: 1 Publisher: Nature Publishing Group.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, July
2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL https://aclanthology.org/
P19-1452.
Michael Tomasello. Origins of human communication. MIT press, 2010.

41
A PREPRINT - JANUARY 18, 2023

John C. Trueswell and Michael K. Tanenhaus. Toward a lexicalist framework of constraint-based syntactic ambiguity
resolution. Lawrence Erlbaum Associates, Inc, 1994.
John C. Trueswell, Michael K. Tanenhaus, and Christopher Kello. Verb-specific constraints in sentence processing:
separating effects of lexical preference from garden-paths. Journal of Experimental psychology: Learning, memory,
and Cognition, 19(3):528, 1993. Publisher: American Psychological Association.
Vahe Tshitoyan, John Dagdelen, Leigh Weston, Alexander Dunn, Ziqin Rong, Olga Kononova, Kristin A Persson,
Gerbrand Ceder, and Anubhav Jain. Unsupervised word embeddings capture latent knowledge from materials science
literature. Nature, 571(7763):95–98, 2019.
Mycal Tucker, Tiwalayo Eisape, Peng Qian, Roger Levy, and Julie Shah. When does syntax mediate neural language
model performance? evidence from dropout probes. In Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5393–5408, Seattle,
United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.394. URL
https://aclanthology.org/2022.naacl-main.394.
Alan M. Turing. Computing Machinery and Intelligence. Mind, 59(October):433–60, 1950. doi: 10.1093/mind/LIX.
236.433. Publisher: Oxford University Press.
Layla Unger and Anna V Fisher. The emergence of richly organized semantic knowledge from simple statistics: A
synthetic review. Developmental Review, 60:100949, 2021.
Akira Utsumi. Exploring what is encoded in distributional word vectors: A neurobiologically motivated analysis.
Cognitive Science, 44(6):e12844, 2020.
Markus J. van Ackeren, Daniel Casasanto, Harold Bekkering, Peter Hagoort, and Shirley-Ann Rueschemeyer. Pragmat-
ics in action: indirect requests engage theory of mind areas and the cortical motor network. Journal of Cognitive
Neuroscience, 24(11):2237–2247, November 2012. ISSN 1530-8898. doi: 10.1162/jocn_a_00274.
Teun Adrianus Van Dijk, Walter Kintsch, and others. Strategies of discourse comprehension. Academic Press, 1983.
Jeroen van Paridon, Qiawen Liu, and Gary Lupyan. How do blind people know that blue is cold? distributional
semantics encode color-adjective associations. In Proceedings of the Annual Meeting of the Cognitive Science Society,
volume 43, 2021.
Marten Van Schijndel and Tal Linzen. Single-stage prediction models do not explain the magnitude of syntactic
disambiguation difficulty. Cognitive science, 45(6):e12988, 2021.
Marten van Schijndel, Aaron Mueller, and Tal Linzen. Quantity doesn’t buy quality syntax with neural language
models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the
9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5831–5837, Hong
Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1592. URL
https://aclanthology.org/D19-1592.
Rosemary A. Varley. Aphasic language, aphasic thought: an investigation of propositional thinking in an a-propositional
aphasic. In Peter Carruthers and JillEditors Boucher, editors, Language and Thought: Interdisciplinary Themes,
pages 128–145. Cambridge University Press, 1998. doi: 10.1017/CBO9780511597909.009.
Rosemary A. Varley and M. Siegal. Evidence for cognition without grammar from causal reasoning and ’theory of
mind’ in an agrammatic aphasic patient. Current biology: CB, 10(12):723–726, June 2000. ISSN 0960-9822. doi:
10.1016/s0960-9822(00)00538-8.
Rosemary A. Varley, M. Siegal, and S. C. Want. Severe impairment in grammar does not preclude theory of mind.
Neurocase, 7(6):489–493, 2001. ISSN 1355-4794. doi: 10.1093/neucas/7.6.489.
Rosemary A. Varley, Nicolai J. C. Klessinger, Charles A. J. Romanowski, and Michael Siegal. Agrammatic but
numerate. Proceedings of the National Academy of Sciences of the United States of America, 102(9):3519–3524,
March 2005. ISSN 0027-8424. doi: 10.1073/pnas.0407470102.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \Lukasz Kaiser, and Illia
Polosukhin. Attention is All you Need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates,
Inc., 2017a. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention Is All You Need. arXiv:1706.03762 [cs], December 2017b. URL http://arxiv.org/abs/
1706.03762. arXiv: 1706.03762.

42
A PREPRINT - JANUARY 18, 2023

Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Proceedings
of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76,
Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4808. URL
https://aclanthology.org/W19-4808.
Elena Voita and Ivan Titov. Information-theoretic probing with minimum description length. In Proceedings of
the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 183–196, Online,
November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.14. URL https:
//aclanthology.org/2020.emnlp-main.14.
Jon Walbrin, Paul Downing, and Kami Koldewyn. Neural responses to visually observed social interactions. Neu-
ropsychologia, 112:31–39, April 2018. ISSN 0028-3932. doi: 10.1016/j.neuropsychologia.2018.02.023. URL
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5899757/.
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking
and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162,
Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL
https://aclanthology.org/D19-1221.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A Multi-Task
Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP
Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018. URL
https://www.aclweb.org/anthology/W18-5446.
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and
Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems.
In 33rd Conference on Neural Information Processing Systems, 2019a. URL https://proceedings.neurips.
cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf. Journal Abbreviation: arXiv preprint
arXiv:1905.00537.
Yining Wang, Long Zhou, Jiajun Zhang, Feifei Zhai, Jingfang Xu, and Chengqing Zong. A compact and language-
sensitive multilingual translation method. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 1213–1223, Florence, Italy, July 2019b. Association for Computational Linguistics.
doi: 10.18653/v1/P19-1117. URL https://aclanthology.org/P19-1117.
Zihan Wang, Karthikeyan K, Stephen Mayhew, and Dan Roth. Extending multilingual BERT to low-resource languages.
In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2649–2656, Online, November
2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.240. URL https://
aclanthology.org/2020.findings-emnlp.240.
Alex Warstadt and Samuel R Bowman. What artificial neural networks can tell us about human language acquisition.
arXiv preprint arXiv:2208.07998, 2022.
Alex Warstadt, Yu Cao, Ioana Grosu, Wei Peng, Hagen Blix, Yining Nie, Anna Alsop, Shikha Bordia, Haokun Liu,
Alicia Parrish, Sheng-Fu Wang, Jason Phang, Anhad Mohananey, Phu Mon Htut, Paloma Jeretic, and Samuel R.
Bowman. Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs. In Proceedings of
EMNLP-IJCNLP, pages 2870–2880, 2019. URL https://www.aclweb.org/anthology/D19-1286.
Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman.
BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Transactions of the Association for Computational
Linguistics, 8:377–392, 2020. doi: 10.1162/tacl_a_00321. URL https://doi.org/10.1162/tacl_a_00321.
Jason Wei, Dan Garrette, Tal Linzen, and Ellie Pavlick. Frequency effects on syntactic rule learning in transformers.
In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 932–948,
Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:
10.18653/v1/2021.emnlp-main.72. URL https://aclanthology.org/2021.emnlp-main.72.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma,
Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William
Fedus. Emergent Abilities of Large Language Models, June 2022a. URL http://arxiv.org/abs/2206.07682.
arXiv:2206.07682 [cs].
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought
prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.
Carl Wernicke. Der aphasische Symptomencomplex: eine psychologische Studie auf anatomischer Basis. Cohn., 1874.

43
A PREPRINT - JANUARY 18, 2023

Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell. What do RNN language models learn about filler–gap
dependencies? In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural
Networks for NLP, pages 211–221, Brussels, Belgium, November 2018. Association for Computational Linguistics.
doi: 10.18653/v1/W18-5423. URL https://aclanthology.org/W18-5423.
Ethan Wilcox, Peng Qian, Richard Futrell, Miguel Ballesteros, and Roger Levy. Structural Supervision Improves
Learning of Non-Local Grammatical Dependencies. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3302–3312, 2019.
Ethan Wilcox, Pranali Vani, and Roger Levy. A Targeted Assessment of Incremental Processing in Neural Language
Models and Humans. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages
939–952, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.76. URL
https://aclanthology.org/2021.acl-long.76.
Roel M. Willems, Yael Benn, Peter Hagoort, Ivan Toni, and Rosemary A. Varley. Communicating without a
functioning language system: Implications for the role of language in mentalizing. Neuropsychologia, 49
(11):3130–3135, September 2011. ISSN 0028-3932. doi: 10.1016/j.neuropsychologia.2011.07.023. URL
https://www.sciencedirect.com/science/article/pii/S0028393211003472.
Deirdre Wilson and Dan Sperber. Relevance theory. In Handbook of pragmatics. Blackwell, 2002.
Stephen M. Wilson, Istvan Molnar-Szakacs, and Marco Iacoboni. Beyond Superior Temporal Cortex: Intersubject
Correlations in Narrative Speech Comprehension. Cerebral Cortex, 18(1):230–242, January 2008. ISSN 1047-3211.
doi: 10.1093/cercor/bhm049. URL https://doi.org/10.1093/cercor/bhm049.
Stephen M. Wilson, Dana K. Eriksson, Melodie Yen, Andrew T. Demarco, Sarah M. Schneck, and Jillian M. Lucanie.
Language Mapping in Aphasia. Journal of Speech, Language, and Hearing Research : JSLHR, 62(11):3937–3946,
November 2019. ISSN 1092-4388. doi: 10.1044/2019_JSLHR-L-RSNP-19-0031. URL https://www.ncbi.nlm.
nih.gov/pmc/articles/PMC7203526/.
Ludwig Wittgenstein. Philosophical investigations. John Wiley & Sons, 1953.
Alexandra Woolgar, Alice Parr, Rhodri Cusack, Russell Thompson, Ian Nimmo-Smith, Teresa Torralva, Maria Roca,
Nagui Antoun, Facundo Manes, and John Duncan. Fluid intelligence loss linked to restricted regions of damage
within frontal and parietal cortex. Proceedings of the National Academy of Sciences, 107(33):14899–14902, August
2010. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1007928107. URL https://www.pnas.org/content/107/
33/14899. ISBN: 9781007928108 Publisher: National Academy of Sciences Section: Biological Sciences.
Alexandra Woolgar, John Duncan, F. Manes, and E. Fedorenko. Fluid intelligence is supported by the multiple-demand
system not the language system. Nature Human Behavior, March 2018. ISSN 2397-3374. doi: 10.17863/CAM.22222.
URL https://www.repository.cam.ac.uk/handle/1810/275048.
Zhengxuan Wu, Atticus Geiger, Joshua Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, and Noah
Goodman. Causal distillation for language models. In Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4288–4295, Seattle,
United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.318. URL
https://aclanthology.org/2022.naacl-main.318.
King-Yin Yan. Agi via combining logic with deep learning. In International Conference on Artificial General
Intelligence, pages 327–343. Springer, 2021.
Guangyu Robert Yang, Madhura R. Joglekar, H. Francis Song, William T. Newsome, and Xiao-Jing Wang. Task
representations in neural networks trained to perform many cognitive tasks. Nature Neuroscience, 22(2):297–306,
February 2019. ISSN 1546-1726. doi: 10.1038/s41593-018-0310-2. URL https://www.nature.com/articles/
s41593-018-0310-2. Number: 2 Publisher: Nature Publishing Group.
Yaara Yeshurun, Mai Nguyen, and Uri Hasson. Amplification of local changes along the timescale processing hierarchy.
Proceedings of the National Academy of Sciences, 114(35):9475–9480, 2017.
Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. Neural-symbolic vqa:
Disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing
Systems, pages 1031–1042, 2018.
Charles Yu, Ryan Sie, Nicolas Tedeschi, and Leon Bergen. Word frequency does not predict grammatical knowledge in
language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 4040–4054, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/
2020.emnlp-main.331. URL https://aclanthology.org/2020.emnlp-main.331.

44
A PREPRINT - JANUARY 18, 2023

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. SWAG: A large-scale adversarial dataset for grounded com-
monsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
pages 93–104. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/D18-1009.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a Machine Really Finish
Your Sentence? arXiv preprint arXiv:1905.07830, 2019.
Honghua Zhang, Liunian Harold Li, Tao Meng, Kai-Wei Chang, and Guy Van den Broeck. On the Paradox of Learning
to Reason from Data, May 2022. URL http://arxiv.org/abs/2205.11502. arXiv:2205.11502 [cs].
Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel R. Bowman. When do you need billions of words of
pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages
1112–1125, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.90.
URL https://aclanthology.org/2021.acl-long.90.
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, and
James Laudon. Mixture-of-Experts with Expert Choice Routing, October 2022. URL http://arxiv.org/abs/
2202.09368. arXiv:2202.09368 [cs].
Chengxu Zhuang, Siming Yan, Aran Nayebi, Martin Schrimpf, Michael C. Frank, James J. DiCarlo, and Daniel L. K.
Yamins. Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of
Sciences, 118(3):e2014196118, January 2021. doi: 10.1073/pnas.2014196118. URL https://www.pnas.org/doi/
10.1073/pnas.2014196118. Publisher: Proceedings of the National Academy of Sciences.

45

You might also like