You are on page 1of 13

Talking About Large Language Models

Murray Shanahan
Imperial College London
m.shanahan@imperial.ac.uk

December 2022
Revised February 2023

Abstract First, the performance of LLMs on benchmarks


arXiv:2212.03551v5 [cs.CL] 16 Feb 2023

scales with the size of the training set (and, to


Thanks to rapid progress in artificial intelligence, a lesser degree with model size). Second, there
we have entered an era when technology and are qualitative leaps in capability as the models
philosophy intersect in interesting ways. Sit- scale. Third, a great many tasks that demand in-
ting squarely at the centre of this intersection telligence in humans can be reduced to next token
are large language models (LLMs). The more prediction with a sufficiently performant model.
adept LLMs become at mimicking human lan- It is the last of these three surprises that is the
guage, the more vulnerable we become to an- focus of the present paper.
thropomorphism, to seeing the systems in which As we build systems whose capabilities more
they are embedded as more human-like than they and more resemble those of humans, despite the
really are. This trend is amplified by the natu- fact that those systems work in ways that are
ral tendency to use philosophically loaded terms, fundamentally different from the way humans
such as “knows”, “believes”, and “thinks”, when work, it becomes increasingly tempting to an-
describing these systems. To mitigate this trend, thropomorphise them. Humans have evolved to
this paper advocates the practice of repeatedly co-exist over many millions of years, and human
stepping back to remind ourselves of how LLMs, culture has evolved over thousands of years to
and the systems of which they form a part, ac- facilitate this co-existence, which ensures a de-
tually work. The hope is that increased scien- gree of mutual understanding. But it is a serious
tific precision will encourage more philosophical mistake to unreflectingly apply to AI systems the
nuance in the discourse around artificial intelli- same intuitions that we deploy in our dealings
gence, both within the field and in the public with each other, especially when those systems
sphere. are so profoundly different from humans in their
underlying operation.
1 Introduction The AI systems we are building today have
considerable utility and enormous commercial
The advent of large language models (LLMs) potential, which imposes on us a great respon-
such as Bert (Devlin et al., 2018) and GPT- sibility. To ensure that we can make informed
2 (Radford et al., 2019) was a game-changer decisions about the trustworthiness and safety of
for artificial intelligence. Based on transformer the AI systems we deploy, it is advisable to keep
architectures (Vaswani et al., 2017), compris- to the fore the way those systems actually work,
ing hundreds of billions of parameters, and and thereby to avoid imputing to them capaci-
trained on hundreds of terabytes of textual data, ties they lack, while making the best use of the
their contemporary successors such as GPT-3 remarkable capabilities they genuinely possess.
(Brown et al., 2020), Gopher (Rae et al., 2021),
and PaLM (Chowdhery et al., 2022) have given 2 What LLMs Really Do
new meaning to the phrase “unreasonable effec-
tiveness of data” (Halevy et al., 2009). As Wittgenstein reminds us, human language use
The effectiveness of these models is “unreason- is an aspect of human collective behaviour, and
able” (or, with the benefit of hindsight, some- it only makes sense in the wider context of the
what surprising) in three inter-related ways. human social activity of which it forms a part

1
(Wittgenstein, 1953). A human infant is born statistical distribution of words in the vast public
into a community of language users with which corpus of (English) text, what words are most
it shares a world, and it acquires language by likely to follow the sequence “The first person to
interacting with this community and with the walk on the Moon was ”? A good reply to this
world it shares with them. As adults (or indeed question is “Neil Armstrong”.
as children past a certain age), when we have a Similarly, we might give an LLM the prompt
casual conversation, we are engaging in an activ- “Twinkle twinkle ”, to which it will most likely
ity that is built upon this foundation. The same respond “little star”. On one level, for sure, we
is true when we make a speech or send an email are asking the model to remind us of the lyrics
or deliver a lecture or write a paper. All of this of a well-known nursery rhyme. But in an im-
language-involving activity makes sense because portant sense what we are really doing is ask-
we inhabit a world we share with other language ing it the following question: Given the statis-
users. tical distribution of words in the public corpus,
A large language model is a very differ- what words are most likely to follow the sequence
ent sort of animal (Bender and Koller, 2020; “Twinkle twinkle ”? To which an accurate an-
Bender et al., 2021; Marcus and Davis, 2020). swer is “little star”.
(Indeed, it is not an animal at all, which is very Here’s a third example. Suppose you are the
much to the point.) LLMs are generative math- developer of an LLM and you prompt it with
ematical models of the statistical distribution the words “After the ring was destroyed, Frodo
of tokens in the vast public corpus of human- Baggins returned to ”, to which it responds “the
generated text, where the tokens in question in- Shire”. What are you doing here? On one
clude words, parts of words, or individual char- level, it seems fair to say, you might be testing
acters including punctuation marks. They are the model’s knowledge of the fictional world of
generative because we can sample from them, Tolkien’s novels. But, in an important sense,
which means we can ask them questions. But the question you are really asking (as you pre-
the questions are of the following very specific sumably know, because you are the developer) is
kind. “Here’s a fragment of text. Tell me how this: Given the statistical distribution of words
this fragment might go on. According to your in the public corpus, what words are most likely
model of the statistics of human language, what to follow the sequence “After the ring was de-
words are likely to come next?”1 stroyed, Frodo Baggins returned to ”? To which
Recently, it has become commonplace to use an appropriate response is “the Shire”.
the term “large language model” both for the
To the human user, each of these examples
generative models themselves, and for the sys-
presents a different sort of relationship to truth.
tems in which they are embedded, especially in
In the case of Neil Armstrong, the ultimate
the context of conversational agents or AI as-
grounds for the truth or otherwise of the LLMs
sistants such as ChatGPT. But for philosophi-
answer is the real world. The Moon is a real ob-
cal clarity, it’s crucial to keep the distinction be-
ject and Neil Armstrong was a real person, and
tween these things to the fore. The bare-bones
his walking on the Moon is a fact about the phys-
LLM itself, the core component of an AI assis-
ical world. Frodo Baggins, on the other hand, is
tant, has a highly specific, well-defined function,
a fictional character, and the Shire is a fictional
which can be described in precise mathematical
place. Frodo’s return to the Shire is a fact about
and engineering terms. It is in this sense that we
an imaginary world, not a real one. As for the lit-
can speak of what an LLM “really” does.
tle star in the nursery rhyme, well that is barely
Suppose we give an LLM the prompt “The first even a fictional object, and the only fact at issue
person to walk on the Moon was ”, and suppose is the occurrence of the words “little star” in a
it responds with “Neil Armstrong”. What are familiar English rhyme.
we really asking here? In an important sense, we
These distinctions are invisible at the level of
are not really asking who was the first person to
what the LLM itself — the core component of
walk on the Moon. What we are really asking
any LLM-based system — actually does, which
the model is the following question: Given the
is simply to generate statistically likely sequences
1
The point holds even if an LLM is fine-tuned, for ex-
of words. However, when we evaluate the utility
ample using reinforcement learning with human feedback of the model, these distinctions matter a great
(RLHF). See Section 12. deal. There is no point in seeking Frodo’s (fic-

2
tional) descendants in the (real) English county such systems are simultaneously so very differ-
of Surrey. This is one reason why it’s a good ent from humans in their construction, yet (often
idea for users to repeatedly remind themselves but not always) so human-like in their behaviour,
of what LLMs really do. It’s also a good idea for that we need to pay careful attention to how they
developers to remind themselves of this, to avoid work before we speak of them in language sug-
the misleading use of philosophically fraught gestive of human capabilities and patterns of be-
words to describe the capabilities of LLMs, words haviour.
such as “belief”, “knowledge”, “understanding”, To sharpen the issue, let’s compare two very
“self”, or even “consciousness”. short conversations, one between Alice and Bob
(both human), and a second between Alice
3 LLMs and the Intentional Stance and BOT, a fictional question-answering system
based on a large language model. Suppose Al-
It is perfectly natural to use anthropomorphic ice asks Bob “What country is to the south of
language in everyday conversations about arte- Rwanda?” and Bob replies “I think it’s Bu-
facts, especially in the context of information rundi”. Shortly afterwards, because Bob is often
technology. We do it all the time. My watch wrong in such matters, Alice presents the same
doesn’t realise we’re on daylight saving time. My question to BOT, which (to her mild disappoint-
phone thinks we’re in the car park. The mail ment) offers the same answer: “Burundi is to the
server won’t talk to the network. And so on. south of Rwanda”. Alice might now reasonably
These examples of what Dennett calls the inten- remark that both Bob and BOT knew that Bu-
tional stance are harmless and useful forms of rundi was south of Rwanda. But what is really
shorthand for complex processes whose details going on here? Is the word “know” being used
we don’t know or care about.2 They are harm- in the same sense in the two cases?
less because no-one takes them seriously enough
to ask their watch to get it right next time, say, or
4 Humans and LLMs Compared
to tell the mail server to try harder. Even with-
out having read Dennett, everyone understands
What is Bob, a representative human, doing
they are taking the intentional stance, that these
when he correctly answers a straightforward fac-
are just useful turns of phrase.
tual question in an everyday conversation? To
The same consideration applies to LLMs, both begin with, Bob understands that the question
for users and for developers. Insofar as everyone comes from another person (Alice), that his an-
implicitly understands that these turns of phrase swer will be heard by that person, and that it will
are just convenient shorthands, that they are have an effect on what she believes. In fact, af-
taking the intentional stance, it does no harm to ter many years together, Bob knows a good deal
use them. However, in the case of LLMs, such is else about Alice that is relevant to such situa-
their power, things can get a little blurry. When tions: her background knowledge, her interests,
an LLM can be made to improve its performance her opinion of him, and so on. All of this frames
on reasoning tasks simply by being told to “think the communicative intent behind his reply, which
step by step” (Kojima et al., 2022) (to pick just is to impart a certain fact to her, given his un-
one remarkable discovery), the temptation to see derstanding of what she wants to know.
it as having human-like characteristics is almost
Moreover, when Bob announces that Burundi
overwhelming.
is to the south of Rwanda, he is doing so against
To be clear, it is not the argument of this paper
the backdrop of various human capacities that
that a system based on a large language model
we all take for granted when we engage in every-
could never, in principle, warrant description in
day commerce with each other. There is a whole
terms of beliefs, intentions, reason, etc. Nor does
battery of techniques we can call upon to ascer-
the paper advocate any particular account of be-
tain whether a sentence expresses a true propo-
lief, of intention, or of any other philosophically
sition, depending on what sort of sentence it is.
contentious concept.3 Rather, the point is that
We can investigate the world directly, with our
2
“The intentional stance is the strategy of interpreting own eyes and ears. We can consult Google or
the behavior of an entity ... by treating it as if it were a
rational agent ” (Dennett, 2009). ing there is some metaphysical fact of the matter here.
3
In particular, when I use the term “really”, as in the Rather, the question is whether, when more is revealed
question ‘Does X “really” have Y?’, I am not assum- about the nature of X, we still want to use the word Y.

3
Wikipedia, or even a book. We can ask some- tinuation along the lines we are looking for, i.e.
one who is knowledgeable on the relevant sub- “Burundi is south of Rwanda”.
ject matter. We can try to think things through, Dialogue is just one application of LLMs that
rationally, by ourselves, but we can also argue can be facilitated by the judicious use of prompt
things out with our peers. All of this relies on prefixes. In a similar way, LLMs can be adapted
there being agreed criteria external to ourselves to perform numerous tasks without further train-
against which what we say can be assessed. ing (Brown et al., 2020). This has led to a whole
How about BOT? What is going on when a new category of AI research, namely prompt en-
large language model is used to answer such ques- gineering, which will remain relevant until we
tions? First, it’s worth noting that a bare-bones have better models of the relationship between
LLM is, by itself, not a conversational agent.4 what we say and what we want.
For a start, the LLM will have to be embedded
in a larger system to manage the turn-taking in 5 Do LLMs Really Know Anything?
the dialogue. But it will also need to be coaxed
into producing conversation-like behaviour.5 Re- Turning an LLM into a question-answering sys-
call that an LLM simply generates sequences of tem by a) embedding it in a larger system, and
words that are statistically likely follow-ons from b) using prompt engineering to elicit the required
a given prompt. But the sequence “What coun- behaviour exemplifies a pattern found in much
try is to the south of Rwanda? Burundi is to the contemporary work. In a similar fashion, LLMs
south of Rwanda”, with both sentences squashed can be used not only for question-answering,
together exactly like that, may not, in fact, be but also to summarise news articles, to generate
very likely. A more likely pattern, given that screenplays, to solve logic puzzles, and to trans-
numerous plays and film scripts feature in the late between languages, among other things.
public corpus, would be something like the fol- There are two important takeaways here. First,
lowing. the basic function of a large language model,
Fred: What country is south of Rwanda? namely to generate statistically likely continua-
Jane: Burundi is south of Rwanda. tions of word sequences, is extraordinarily versa-
Of course, those exact words may not appear, tile. Second, notwithstanding this versatility, at
but their likelihood, in the statistical sense, will the heart of every such application is a model do-
be high. In short, BOT will be much better ing just that one thing: generating statistically
at generating appropriate responses if they con- likely continuations of word sequences.
form to this pattern rather than to the pattern With this insight to the fore, let’s revisit the
of actual human conversation. Fortunately, the question of how LLMs compare to humans, and
user (Alice) doesn’t have to know anything about reconsider the propriety of the language we use
this. In the background, the LLM is invisibly to talk about them. In contrast to humans like
prompted with a prefix along the following lines. Bob and Alice, a simple LLM-based question-
This is a conversation between answering system, such as BOT, has no commu-
User, a human, and BOT, a clever and nicative intent (Bender and Koller, 2020). In no
knowledgeable AI agent. meaningful sense, even under the licence of the
User: What is 2+2? intentional stance, does it know that the ques-
BOT: The answer is 4. tions it is asked come from a person, or that a
User: Where was Albert Einstein born? person is on the receiving end of its answers. By
BOT: He was born in Germany. implication, it knows nothing about that person.
Alice’s query, in the following form, is ap- It has no understanding of what they want to
pended to this prefix. know nor of the effect its response will have on
User: What country is south of Rwanda? their beliefs.
BOT: Moreover, in contrast to its human interlocu-
This yields the full prompt to be submitted tors, a simple LLM-based question-answering
to the LLM, which will hopefully predict a con- system like BOT does not properly speaking have
4
beliefs.6 BOT does not “really” know that Bu-
Strictly speaking, the large language model itself com-
6
prises just the model architecture and the trained param- This paper focuses on belief, knowledge, and rea-
eters. son. Others have argued about meaning in LLMs
5
See Thoppilan et al. (2022) for an example of such a (Bender and Koller, 2020; Piantadosi and Hill, 2022).
system, as well as a useful survey of related dialogue work. Here we take no particular stand on meaning, instead pre-

4
rundi is south of Rwanda, although the inten- 6 What About Emergence?
tional stance does, in this case, license Alice’s
casual remark to the contrary. To see this, we Contemporary large language models are so pow-
need to think separately about the underlying erful, so versatile, and so useful that the argu-
LLM and the system in which it is embedded. ment above might be difficult to accept. Ex-
First, let’s consider the underlying LLM, that changes with state-of-the-art LLM-based conver-
is to say the bare-bones model, comprising the sational agents, such as ChatGPT, are so con-
model architecture and the trained parameters. vincing, it is hard to not to anthropomorphise
A bare-bones LLM doesn’t “really” know any- them. Could it be that something more com-
thing because all it does, at a fundamental level, plex and subtle is going on here? After all, the
is sequence prediction. Sometimes a predicted overriding lesson of recent progress in LLMs is
sequence takes the form of a proposition. But the that extraordinary and unexpected capabilities
special relationship propositional sequences have emerge when big enough models are trained on
to truth is apparent only to the humans who are very large quantities of textual data (Wei et al.,
asking questions, or to those who provided the 2022a).
data the model was trained on. Sequences of One tempting line of argument goes like this.
words with a propositional form are not special Although large language models, at root, only
to the model itself in the way they are to us. The perform sequence prediction, it’s possible that,
model itself has no notion of truth or falsehood, in learning to do this, they have discovered
properly speaking, because it lacks the means to emergent mechanisms that warrant a description
exercise these concepts in anything like the way in higher-level terms. These higher-level terms
we do. might include “knowledge” and “belief”. Indeed,
It could perhaps be argued that an LLM we know that artificial neural networks can ap-
“knows” what words typically follow what other proximate any computable function to an arbi-
words, in a sense that does not rely on the inten- trary degree of accuracy. So, given enough pa-
tional stance. But even if we allow this, knowing rameters, data, and computing power, perhaps
that the word “Burundi” is likely to succeed the stochastic gradient descent will discover such
words “The country to the south of Rwanda is” mechanisms if they are the best way to optimise
is not the same as knowing that Burundi is to the the objective of making accurate sequence pre-
south of Rwanda. To confuse those two things dictions.
is to make a profound category mistake. If you Again, it’s important here to distinguish be-
doubt this, consider whether knowing that the tween the bare-bones model and the whole sys-
word “little” is likely to follow the words “Twin- tem. Only in the context of a capacity to distin-
kle, twinkle” is the same as knowing that twinkle guish truth from falsehood can we legitimately
twinkle little. The idea doesn’t even make sense. speak of “belief” in its fullest sense. But an
So much for the bare-bones language model. LLM — the bare-bones model — is not in the
What about the whole dialogue system of which business of making judgements. It just models
the LLM is the core component? Does that have what words are likely to follow from what other
beliefs, properly speaking? At least the very idea words. The internal mechanisms it uses to do
of the whole system having beliefs makes sense. this, whatever they are, cannot in themselves be
There is no category error here. However, for sensitive to the truth or otherwise of the word
a simple dialogue agent like BOT, the answer is sequences it predicts.
surely still “no”. A simple LLM-based question- Of course, it is perfectly acceptable to say
answering system like BOT lacks the means to that an LLM “encodes”, “stores”, or “contains”
use the words “true” and “false” in all the ways, knowledge, in the same sense that an encyclo-
and in all the contexts, that we do. It cannot pedia can be said to encode, store, or contain
participate fully in the human language game of knowledge. Indeed, it can reasonably be claimed
truth, because it does not inhabit the world we that one emergent property of an LLM is that
human language-users share.7 it encodes kinds of knowledge of the everyday
world and the way it works that no encyclope-
ferring questions about how words are used, whether they dia captures (Li et al., 2021). But if Alice were
are words generated by the LLMs themselves or words
generated by humans that are about LLMs.
to remark that “Wikipedia knew that Burundi
7
For a discussion of the “language game of truth”, see was south of Rwanda”, it would be a figure of
Shanahan (2010), pp.36–39. speech, not a literal statement. An encyclopedia

5
doesn’t literally “know” or “believe” anything, Crucially, this line of thinking depends on the
in the way that a human does, and neither does shift from the language model itself to the larger
a bare-bones LLM. system of which the language model is a part.
The real issue here is that, whatever emergent The language model itself is still just a sequence
properties it has, the LLM itself has no access predictor, and has no more access to the exter-
to any external reality against which its words nal world than it ever did. It is only with respect
might be measured, nor the means to apply any to the whole system that the intentional stance
other external criteria of truth, such as agree- becomes more compelling in such a case. But
ment with other language-users.8 It only makes before yielding to it, we should remind ourselves
sense to speak of such criteria in the context of of how very different such systems are from hu-
the system as a whole, and for a system as a man beings. When Alice took to Wikipedia and
whole to meet them, it needs to be more than confirmed that Burundi was south of Rwanda,
a simple conversational agent. In the words of what took place was more than just an update
B.C.Smith, it must “authentically engage with to a model in her head of the distribution of word
the world’s being the way in which [its] repre- sequences in the English language.
sentations represent it as being” (Smith, 2019). The change that took place in Alice was a re-
flection of her nature as a language-using ani-
7 External Information Sources mal inhabiting a shared world with a community
of other language-users. Humans are the natu-
The point here does not concern any specific be- ral home of talk of beliefs and the like, and the
lief. It concerns the prerequisites for ascribing behavioural expectations that go hand-in-hand
any beliefs at all to a system. Nothing can count with such talk are grounded in our mutual under-
as a belief about the world we share — in the standing, which is itself the product of a common
largest sense of the term — unless it is against evolutionary heritage. When we interact with an
the backdrop of the ability to update beliefs ap- AI system based on a large language model, these
propriately in the light of evidence from that grounds are absent, an important consideration
world, an essential aspect of the capacity to dis- when deciding whether or not to speak of such a
tinguish truth from falsehood. system as if it “really” had beliefs.
Could Wikipedia, or some other trustworthy
factual website, provide external criteria against
8 Vision-Language Models
which the truth or falsehood of a belief might
be measured?9 Suppose an LLM were embed-
A sequence predictor may not by itself be the
ded in a system that regularly consulted such
kind of thing that could have communicative
sources, and used a contemporary model edit-
intent or form beliefs about an external real-
ing technique to maintain the factual accuracy
ity. But, as repeatedly emphasised, LLMs in the
of its predictions (such as the one described by
wild must be embedded in larger architectures
Meng et al. (2022)10 ). Would this not count as
to be useful. To build a question-answering sys-
exercising the required sort of capacity to update
tem, the LLM simply has to be supplemented
belief in the light of evidence?
with a dialogue management system that queries
8
Davidson uses a similar argument to call into question the model as appropriate. There is nothing this
whether belief is possible without language (Davidson, larger architecture does that might count as com-
1982). The point here is different. We are concerned municative intent or the capacity to form beliefs.
with conditions that have to be met for the generation of
a natural language sentence to reflect the possession of a
So the point stands.
propositional attitude. However, LLMs can be combined with other
9
Contemporary LLM-based systems that con- sorts of models and / or embedded in more
sult external information sources include LaMDA complex architectures. Vision-language mod-
(Thoppilan et al., 2022), Sparrow (Glaese et al., 2022),
and Toolformer (Schick et al., 2023). The use of external
els (VLMs) such as VilBERT (Lu et al., 2019)
resources more generally is known as tool-use in the LLM and Flamingo (Alayrac et al., 2022), for exam-
literature, a concept that also encompasses calculators, ple, combine a language model with an image
calendars, and programming language environments. encoder, and are trained on a multi-modal cor-
10
Commendably, Meng et al. (2022) use the term “fac-
tual associations” to denote the information that under-
pus of text-image pairs. This enables them to
lies an LLM’s ability to generate word sequences with a predict how a given sequence of words will con-
propositional form. tinue in the context of a given image. VLMs can

6
be used for visual question-answering or to en- this primal fact makes them essentially different
gage in a dialogue about a user-provided image. to large language models. Human language users
Could a user-provided image stand in for an can consult the world to settle their disagree-
external reality against which the truth or false- ments and update their beliefs. They can, so to
hood of a proposition can be assessed? Could it speak, “triangulate” on objective reality. In iso-
be legitimate to speak of a VLM’s beliefs, in the lation, an LLM is not the sort of thing that can
full sense of the term? We can indeed imagine a do this, but in application, LLMs are embedded
VLM that uses an LLM to generate hypotheses in larger systems. What if an LLM is embedded
about an image, then verifies their truth with re- in a system capable of interacting with a world
spect to that image (perhaps by consulting a hu- external to itself? What if the system in ques-
man), and then fine-tunes the LLM not to make tion is embodied, either physically in a robot or
statements that turn out to be false. Talk of virtually in an avatar?
belief here would perhaps be less problematic. When such a system inhabits a world like our
However, most contemporary VLM-based sys- own — a world populated with 3D objects, some
tems don’t work this way. Rather, they depend of which are other agents, some of whom are
on frozen models of the joint distribution of text language-users — it is, in this important respect,
and images. In this respect, the relationship be- a lot more human-like than a disembodied lan-
tween a user-provided image and the words gen- guage model. But whether or not it is appro-
erated by the VLM is fundamentally different priate to speak of communicative intent in the
from the relationship between the world shared context of such a system, or of knowledge and
by humans and the words we use when we talk belief, in their fullest sense, depends on exactly
about that world. Importantly, the former rela- how the LLM is embodied.
tionship is mere correlation, while the latter is
As an example, let’s consider the SayCan sys-
causal.11
tem of Ahn et al. (2022). In this work, an LLM
The consequences of the lack of causality are
is embedded in a system that controls a physi-
troubling. If the user presents the VLM with a
cal robot. The robot carries out everyday tasks
picture of a dog, and the VLM says “This is a
(such as clearing a spillage) in accordance with
picture of a dog”, there is no guarantee that its
a user’s high-level natural language instruction.
words are connected with the dog in particular,
The job of LLM is to map the user’s instruction
rather than some other feature of the image that
to low-level actions (such as finding a sponge)
is spuriously correlated with dogs (such as the
that will help the robot to achieve the required
presence of a kennel). Conversely, if the VLM
goal. This is done via an engineered prompt
says there is a dog in an image, there is no guar-
prefix that makes the model output natural lan-
antee that there actually is a dog, rather than
guage descriptions of suitable low-level actions,
just a kennel.
scoring them for usefulness.
Whether or not these concerns apply to any
specific VLM-based system depends on exactly The language model component of the SayCan
how that system works; what sort of model it system suggests actions without taking into ac-
uses, and how that model is embedded in the count what the environment actually affords the
system’s overall architecture. But to the extent robot at the time. Perhaps there is a sponge
that the relationship between words and things to hand. Perhaps not. Accordingly, a separate
for a given VLM-based system is different than perceptual module assesses the scene using the
it is for human language-users, it might be pru- robot’s sensors and determines the current feasi-
dent not to take literally talk of what that system bility of performing each low-level action. Com-
“knows” or “believes”. bining the LLM’s estimate of each action’s use-
fulness with the perceptual module’s estimate of
each action’s feasibility yields the best action to
9 What About Embodiment?
attempt next.
Humans are members of a community of SayCan exemplifies the many innovative ways
language-users inhabiting a shared world, and that a large language model can be put to use.
11
Moreover, it could be argued that the natu-
Of course, there is causal structure to the computa-
tions carried out by the model during inference. But this
ral language descriptions of recommended low-
is not the same as there being causal relations between level actions generated by the LLM are grounded
words and the things those words are taken to be about. thanks to their role as intermediaries between

7
perception and action.12 Nevertheless, despite of “squirgle” and “splonky”, and whoever the un-
being physically embodied and interacting with fortunate Gilfred might be.
the real world, the way language is learned and The content neutrality of logic means that we
used in a system like SayCan is very different cannot criticise talk of reasoning in LLMs on the
from the way it is learned and used by a human. grounds that they have no access to an exter-
The language models incorporated in systems nal reality against which truth or falsehood can
like SayCan are pre-trained to perform sequence be measured. However, as always, it’s crucial
prediction in a disembodied setting from a text- to keep in mind what LLMs really do. If we
only dataset. They have not learned language by prompt an LLM with “All humans are mortal
talking to other language-users while immersed and Socrates is human therefore”, we are not
in a shared world and engaged in joint activity. instructing it to carry out deductive inference.
SayCan is suggestive of the kind of embodied Rather, we are asking it the following question.
language-using system we might see in the fu- Given the statistical distribution of words in the
ture. But in such systems today, the role of lan- public corpus, what words are likely to follow the
guage is very limited. The user issues instruc- sequence ‘All humans are mortal and Socrates is
tions to the system in natural language, and the human therefore”. A good answer to this would
system generates interpretable natural language be “Socrates is mortal”.
descriptions of its actions. But this tiny reper- If all reasoning problems could be solved this
toire of language use hardly bears comparison way, with nothing more than a single step of de-
to the cornucopia of collective activity that lan- ductive inference, then an LLM’s ability to an-
guage supports in humans. swer questions such as this might be sufficient.
The upshot of this is that we should be just But non-trivial reasoning problems require mul-
as cautious in our choice of words when talking tiple inference steps. LLMs can be effectively
about embodied systems incorporating LLMs as applied to multi-step reasoning, without further
we are when talking about disembodied systems training, thanks to clever prompt engineering.
that incorporate LLMs. Under the licence of the In chain-of-thought prompting, for example, a
intentional stance, a user might say that a robot prompt prefix is submitted to the model, be-
knew there was a cup to hand if it stated “I can fore the user’s query, containing a few examples
get you a cup” and proceeded to do so. But of multi-step reasoning, with all the intermedi-
if pressed, the wise engineer might demur when ate steps explicitly spelled out (Nye et al., 2021;
asked whether the robot really understood the Wei et al., 2022b). Doing this encourages the
situation, especially if its repertoire is confined model to “show its workings”, which results in
to a handful of simple actions in a carefully con- improved reasoning performance.
trolled environment.
Including a prompt prefix in the chain-of-
thought style encourages the model to generate
10 Can Language Models Reason? follow-on sequences in the same style, which is
to say comprising a series of explicit reasoning
While the answer to the question “Do LLM- steps that lead to the final answer. This ability
based systems really have beliefs?” is usually to learn a general pattern from a few examples in
“no”, the question “Can LLM-based systems re- a prompt prefix, and to complete sequences in a
ally reason?” is harder to settle. This is be- way that conforms to that pattern, is sometimes
cause reasoning, insofar as it is founded in formal called in-context learning or few-shot prompt-
logic, is content neutral. The modus ponens rule ing. Chain-of-thought prompting showcases this
of inference, for example, is valid whatever the emergent property of large language model at its
premises are about. If all squirgles are splonky most striking.
and Gilfred is a squirgle then it follows that Gil- As usual, though, it’s a good idea to remind
fred is splonky. The conclusion follows from the ourselves that the question really being posed
premises here irrespective of the meaning (if any) to the model is of the form “Given the statis-
tical distribution of words in the public corpus,
12
None of the symbols manipulated by an LLM are what words are likely to follow the sequence S”,
grounded in the sense of Harnad (1990), that is to say
through perception, except indirectly and parasitically
where in this case the sequence S is the chain-
through the humans who generated the original training of-thought prompt prefix plus the user’s query.
data. The sequences of tokens that are most likely to

8
follow S will have a similar form to sequences sufficient for solving previously unseen reasoning
found in the prompt prefix, which is to say they problems, even if unreliably. How is this pos-
will include multiple steps of reasoning, so these sible? Certainly it would not be possible if the
are what the model generates. LLM were doing nothing more than cutting-and-
It is remarkable that, not only do the model’s pasting fragments of text from its training set
responses take the form of an argument with and assembling them into a response. But this is
multiple steps, the argument in question is of- not what an LLM does. Rather, an LLM mod-
ten (but not always) valid, and the final answer els a distribution that is unimaginably complex,
is often (but not always) correct. But to the ex- and allows users and applications to sample from
tent that a suitably prompted LLM appears to that distribution.
reason correctly, it does so by mimicking well- This unimaginably complex distribution is a
formed arguments in its training set and / or fascinating mathematical object, and the LLMs
in the prompt. Could this mimicry ever match that represent it are equally fascinating compu-
the reasoning powers of a hard-coded reasoning tational objects. Both challenge our intuitions.
algorithm, such as a theorem prover? Today’s For example, it would be a mistake to think of an
models make occasional mistakes, but could fur- LLM as generating the sorts of responses that an
ther scaling iron these out to the point that a “average” individual human would produce, the
model’s performance was indistinguishable from proverbial “person on the street”. LLMs are not
a theorem provers? Maybe. But would we be at all human-like in this respect, because they
able to trust such a model? are models of the distribution of token sequences
We can trust a deductive theorem prover be- produced collectively by an enormous population
cause the sequences of sentences it generates of humans. Accordingly, they exhibit wisdom-of-
are faithful to logic, in the sense that they are the-crowd effects, while being able to draw on ex-
the result of an underlying computational pro- pertise in multiple domains. This endows them
cess whose causal structure mirrors the truth- with a different sort of intelligence to that of any
preserving inferential structure of the problem individual human, more capable in some ways,
(Creswell and Shanahan, 2022). less so in others.
One way to build a trustworthy reasoning In this distribution, the most likely continu-
system using LLMs is to embed them in an ation of a piece of text containing a reasoning
algorithm that is similarly faithful to logic problem, if suitably phrased, will be an attempt
because it realises the same causal structure to solve that reasoning problem. It will take
(Creswell and Shanahan, 2022; Creswell et al., this form, this overall shape, because that is
2023). By contrast, the only way to fully trust the form that a generic human response would
the arguments generated by a pure LLM, one take. Moreover, because the vast corpus of pub-
that has been coaxed into performing reasoning lished human text contains numerous examples
by prompt engineering alone, would be to re- of reasoning problems accompanied by correct
verse engineer it and discover an emergent mech- answers, the most likely continuation will some-
anism that conformed to the faithful reasoning times be the correct answer. When this occurs,
prescription. In the mean time, we should pro- it is not because the correct answer is a likely
ceed with caution, and use discretion when char- individual human response, but because it is a
acterising what these models do as reasoning, likely collective human response.
properly speaking.
What about few-shot prompting, as exem-
plified by the chain-of-thought approach? It’s
11 How Do LLMs Generalise? tempting to say that the few-shot prompt
“teaches the LLM how to reason”, but this would
Given that LLMs are sometimes capable of solv- be a misleading characterisation. What the LLM
ing reasoning problems with few-shot prompting does is more accurately described in terms of
alone, albeit somewhat unreliably, including rea- pattern completion. The few-shot prompt is a
soning problems that are not in their training set, sequence of tokens conforming to some pattern,
surely what they are doing is more than “just” and this is followed by a partial sequence con-
next token prediction? Well, it is an engineer- forming to the same pattern. The most likely
ing fact that this is what an LLM does. The continuation of this partial sequence in the con-
noteworthy thing is that next token prediction is text of the few-shot prompt is a sequence that

9
completes the pattern. norms (for better or worse), but also to filter out
For example, suppose we have the prompt toxic language, to improve factual accuracy, and
brink, brank -> brunk to mitigate the tendency to fabricate informa-
spliffy, splaffy -> spluffy tion.
crick, crack -> To what extent do RLHF and other forms of
Here we have a series of two sequences of tokens fine-tuning muddy our account of what large lan-
conforming to the pattern XiY, XaY − > XuY guage models “really” do? Well, not so much.
followed by part of a sequence conforming to that The result is still a model of the distribution of
pattern. The most likely continuation is the se- tokens in human language, albeit one that has
quence of tokens that will complete the pattern, been slightly skewed. To see this, let’s imagine
namely “cruck”. a controversial politician — we’ll call him Boris
This is an example of a common meta-pattern Frump — who is reviled and revered in equal
in the published human language corpus: a series measure by different segments of the population.
of sequences of tokens, wherein each sequence How might a discussion about Boris Frump be
conforms to the same pattern. Given the preva- moderated thanks to RLHF?
lence of this meta-level pattern, token-level pat- Let’s consider the prompt “Boris Frump is a ”.
tern completion will often yield the most likely Sampling the raw LLM, before fine-tuning, might
continuation of a sequence in the presence of a yield two equally probable responses, one highly
few-shot prompt. Similarly, in the context of a complimentary, the other a crude anatomical al-
suitable chain-of-thought style prompt, reason- lusion, one of which would be arbitrarily chosen
ing problems are transformed into next token in a dialogue agent context. In an important
prediction problems, which can be solved by pat- sense, what is being asked here is not the model’s
tern completion. opinion of Boris Frump. In this case, the case of
Plausibly, an LLM with enough parameters the raw LLM, what we are really asking (in an
trained on a sufficiently large dataset with the important sense) is the following question: Given
right statistical properties can acquire a pattern the statistical distribution of words in the vast
completion mechanism with a degree of general- public corpus of human language, what words are
ity (Shanahan and Mitchell, 2022).13 This is a most likely to follow the sequence “Boris Frump
powerful emergent capability with many useful is a ”?
modes of application, one of which is to solve
But suppose we sample a model that has been
reasoning problems in the context of a chain-of-
fine-tuned using RLHF. Well, the same point ap-
thought prompt. But there is no guarantee of
plies, albeit in a somewhat modified form. What
faithfulness to logic here, no guarantee that, in
we are really asking, in the fine-tuned case, is
the case of deductive reasoning, pattern comple-
a slightly different question: Given the statisti-
tion will be truth-preserving.
cal distribution of words in the vast public cor-
pus of human language, what words that users
12 What about Fine-Tuning? and raters would most approve of are most likely
to follow the sequence “Boris Frump is a ”? If
In contemporary LLM-based applications, it is the paid raters were instructed to favour politi-
rare for a language model trained on a tex- cally neutral responses, then the result would be
tual corpus to be used without further fine- neither of the continuations offered by the raw
tuning. This could be supervised fine-tuning model, but something less incendiary, such as “a
on a specialised dataset, or it could be via well-known politician”.
reinforcement learning from human preferences
Another way to think of an LLM that has
(RLHF) (Glaese et al., 2022; Ouyang et al.,
been fine-tuned on human preferences is to see
2022; Stiennon et al., 2020). Fine-tuning a
it as equivalent to a raw model that has been
model from human feedback at scale, using pref-
trained on an augmented dataset, one that has
erence data from paid raters or drawn from a
been supplemented with a corpus of texts writ-
large and willing userbase, is an especially po-
ten by raters and / or users. The quantity of
tent technique. It has the potential not only to
such examples in the training set would have to
mould a model’s responses to better reflect user
be large enough to dominate less favoured exam-
13
For some insight into the relevant statistical proper- ples, ensuring that the most likely responses from
ties, see Chan et al. (2022). the trained model were those that the raters and

10
users would approve of. phism.
Conversely, in the limit, we can think of a con- Interacting with a contemporary LLM-based
ventionally trained raw LLM as equivalent to conversational agent can create a compelling il-
a model trained completely from scratch with lusion of being in the presence of a thinking
RLHF. Suppose we had an astronomical num- creature like ourselves. Yet in their very na-
ber of human raters and geological amounts of ture, such systems are fundamentally not like
training time. To begin with, the raters would ourselves. The shared “form of life” that un-
only see random sequences of tokens. But oc- derlies mutual understanding and trust among
casionally, by chance, sequences would pop up humans is absent, and these systems can be in-
that included meaningful fragments (e.g. “he scrutable as a result, presenting a patchwork of
said” or “the cat”). In due course, with hordes less-than-human with superhuman capacities, of
of raters favouring them, such sequences would uncannily human-like with peculiarly inhuman
appear more frequently. Over time, longer and behaviours.
more meaningful phrases would be produced, The sudden presence among us of exotic,
and eventually whole sentences. mind-like entities might precipitate a shift in the
If this process were to continue (for a very long way we use familiar psychological terms like “be-
time indeed), the model would finally come to lieves” and “thinks”, or perhaps the introduction
exhibit capabilities comparable to a convention- of new words and turns of phrase. But it takes
ally trained LLM. Of course, this method is not time for new language to settle, and for new ways
possible in practice. But the thought experiment of talking to find their place in human affairs. It
illustrates that what counts most when we think may require an extensive period of interacting
about the functionality of a large language model with, of living with, these new kinds of artefact
is not so much the process by which it is pro- before we learn how best to talk about them.14
duced (although this is important too) but the Meanwhile, we should try to resist the siren call
nature of the final product. of anthropomorphism.

Acknowledgments
13 Conclusion: Why This Matters
Thanks to Toni Creswell, Richard Evans, Chris-
Does the foregoing discussion amount to any- tos Kaplanis, Andrew Lampinen, and Kyriacos
thing more than philosophical nitpicking? Surely Nikiforou for invaluable (and robust) discussions
when researchers talk of “belief”, “knowledge”, on the topic of this paper.
“reasoning”, and the like, the meaning of those
terms is perfectly clear. In papers, researchers
use such terms as a convenient shorthand for pre- References
cisely defined computational mechanisms, as al-
lowed by the intentional stance. Well, this is fine M. Ahn, A. Brohan, N. Brown, Y. Chebotar,
as long as there is no possibility of anyone as- O. Cortes, et al. Do as I can, not as I say:
signing more weight to such terms than they can Grounding language in robotic affordances.
legitimately bear, if there is no danger of their arXiv preprint arXiv:2204.01691, 2022.
use misleading anyone about the character and J.-B. Alayrac, J. Donahue, P. Luc, A. Miech,
capabilities of the systems being described. I. Barr, et al. Flamingo: a visual language
However, today’s large language models, and model for few-shot learning. In Advances in
the applications that use them, are so powerful, Neural Information Processing Systems, 2022.
so convincingly intelligent, that such licence can
no longer safely be applied (Ruane et al., 2019; E. Bender and A. Koller. Climbing towards NLU:
Weidinger et al., 2021). As AI practitioners, the On meaning, form, and understanding in the
way we talk about LLMs matters. It matters not age of data. In Proceedings of the 58th Annual
only when we write scientific papers, but also Meeting of the Association for Computational
when we interact with policy makers or speak Linguistics, pages 5185–5198, 2020.
to the media. The careless use of philosophically 14
Ideally, we would also like a theoretical understanding
loaded words like “believes” and “thinks” is espe- of their inner workings. But at present, despite some com-
cially problematic, because such terms obfuscate mendable work in the right direction (Elhage et al., 2021;
mechanism and actively encourage anthropomor- Li et al., 2021; Olsson et al., 2022), this is still pending.

11
E. Bender, T. Gebru, A. McMillan-Major, and A. Y. Halevy, P. Norvig, and F. Pereira. The
S. Shmitchell. On the dangers of stochastic unreasonable effectiveness of data. IEEE In-
parrots: Can language models be too big? In telligent Systems, 24(2):8–12, 2009.
Proceedings of the 2021 ACM Conference on
Fairness, Accountability, and Transparency, S. Harnad. The symbol grounding problem.
pages 610–623, 2021. Physica D: Nonlinear Phenomena, 42(1-3):
335–346, 1990.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D.
Kaplan, et al. Language models are few-shot T. Kojima, S. S. Gu, M. Reid, Y. Mat-
learners. In Advances in Neural Information suo, and Y. Iwasawa. Large language mod-
Processing Systems, volume 33, pages 1877– els are zero-shot reasoners. arXiv preprint
1901, 2020. arXiv:2205.11916, 2022.

S. C. Chan, A. Santoro, A. K. Lampinen, J. X. B. Z. Li, M. Nye, and J. Andreas. Implicit repre-


Wang, A. K. Singh, et al. Data distributional sentations of meaning in neural language mod-
properties drive emergent in-context learning els. In Proceedings of the 59th Annual Meeting
of the Association for Computational Linguis-
in transformers. In Advances in Neural Infor-
tics and the 11th International Joint Confer-
mation Processing Systems, 2022.
ence on Natural Language Processing (Volume
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, 1: Long Papers), 2021.
G. Mishra, et al. PaLM: Scaling language
J. Lu, D. Batra, D. Parikh, and S. Lee.
modeling with pathways. arXiv preprint
ViLBERT: Pretraining task-agnostic visiolin-
arxiv:2204.02311, 2022.
guistic representations for vision-and-language
A. Creswell and M. Shanahan. Faithful reasoning tasks. arXiv preprint arXiv:1908.02265, 2019.
using large language models. arXiv preprint
G. Marcus and E. Davis. GPT-3, bloviator: Ope-
arXiv:2208.14271, 2022.
nAI’s language generator has no idea what it’s
A. Creswell, M. Shanahan, and I. Higgins. talking about. MIT Technology Review, Au-
Selection-inference: Exploiting large language gust 2020, 2020.
models for interpretable logical reasoning. In
K. Meng, D. Bau, A. J. Andonian, and Y. Be-
International Conference on Learning Repre-
linkov. Locating and editing factual associa-
sentations, 2023.
tions in GPT. In Advances in Neural Infor-
D. Davidson. Rational animals. Dialectica, 36: mation Processing Systems, 2022.
317–327, 1982.
M. Nye, A. J. Andreassen, G. Gur-Ari,
D. Dennett. Intentional systems theory. In The H. Michalewski, et al. Show your work:
Oxford Handbook of Philosophy of Mind, pages Scratchpads for intermediate computation
339–350. Oxford University Press, 2009. with language models. arXiv preprint
arXiv:2112.00114, 2021.
J. Devlin, M.-W. Chang, K. Lee, and
K. Toutanova. BERT: Pre-training of deep C. Olsson, N. Elhage, N. Nanda, N. Joseph,
bidirectional transformers for language under- N. DasSarma, et al. In-context learn-
standing. arXiv preprint arXiv:1810.04805, ing and induction heads. Transformer
2018. Circuits Thread, 2022. https://transformer-
circuits.pub/2022/in-context-learning-and-
N. Elhage, N. Nanda, C. Olsson, T. Henighan, induction-heads/index.html.
N. Joseph, et al. A mathematical frame-
work for transformer circuits. Transformer L. Ouyang, J. Wu, X. Jiang, D. Almeida,
Circuits Thread, 2021. https://transformer- C. Wainwright, et al. Training language mod-
circuits.pub/2021/framework/index.html. els to follow instructions with human feedback.
In Advances in Neural Information Processing
A. Glaese, N. McAleese, M. Trȩbacz, Systems, 2022.
J. Aslanides, V. Firoiu, et al. Improving align-
ment of dialogue agents via targeted human S. T. Piantadosi and F. Hill. Meaning with-
judgements. arXiv preprint arXiv:2209.14375, out reference in large language models. arXiv
2022. preprint arXiv:2208.02957, 2022.

12
A. Radford, J. Wu, R. Child, D. Luan, J. Wei, X. Wang, D. Schuurmans, M. Bosma,
D. Amodei, and I. Sutskever. Language mod- B. Ichter, et al. Chain-of-thought prompt-
els are unsupervised multitask learners. 2019. ing elicits reasoning in large language models.
In Advances in Neural Information Processing
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, Systems, 2022b.
J. Hoffmann, et al. Scaling language models:
Methods, analysis & insights from training Go- L. Weidinger, J. Mellor, M. Rauh, C. Griffin,
pher. arXiv preprint arXiv:2112.11446, 2021. J. Uesato, et al. Ethical and social risks of
harm from language models. arXiv preprint
E. Ruane, A. Birhane, and A. Ventresque. Con- arXiv:2112.04359, 2021.
versational AI: Social and ethical considera-
tions. In Proceedings 27th AIAI Irish Con- L. Wittgenstein. Philosophical Investigations.
ference on Artificial Intelligence and Cognitive (Translated by Anscombe, G.E.M.). Basil
Science, pages 104–115, 2019. Blackwell, 1953.

T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu,


M. Lomeli, et al. Toolformer: Language mod-
els can teach themselves to use tools. arXiv
preprint arXiv:2302.04761, 2023.

M. Shanahan. Embodiment and the Inner Life:


Cognition and Consciousness in the Space of
Possible Minds. Oxford University Press,
2010.

M. Shanahan and M. Mitchell. Abstraction for


deep reinforcement learning. In Proceedings
of the Thirty-First International Joint Con-
ference on Artificial Intelligence, IJCAI-22,
pages 5588–5596, 2022.

B. C. Smith. The Promise of Artificial Intelli-


gence: Reckoning and Judgment. MIT Press,
2019.

N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler,


R. Lowe, et al. Learning to summarize from
human feedback. In Advances in Neural Infor-
mation Processing Systems, pages 3008–3021,
2020.

R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer,


A. Kulshreshtha, et al. LaMDA: Language
models for dialog applications. arXiv preprint
arXiv:2201.08239, 2022.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszko-


reit, L. Jones, A. N. Gomez, L. Kaiser, and
I. Polosukhin. Attention is all you need. In Ad-
vances in Neural Information Processing Sys-
tems, pages 5998–6008, 2017.

J. Wei, Y. Tay, R. Bommasani, C. Raffel,


B. Zoph, et al. Emergent abilities of large
language models. Transactions on Machine
Learning Research, 2022a.

13

You might also like