The Language Machines: A Remarkable AI Can Write Like Humans - But With No Understanding of What It's Saying

Feature
ILLUSTRATION BY ORI TOOR

THE LANGUAGE MACHINES
A remarkable AI can write like humans — but with no
understanding of what it’s saying. By Matthew Hutson
I
n June 2020, a new and powerful artificial an estimated cost of tens of millions of dollars. technical manuals. It’s hilarious and frighten-
intelligence (AI) began dazzling technol- The developers invited to try out GPT-3 were ing. I feel like I’ve seen the future.”
ogists in Silicon Valley. Called GPT-3 and astonished. “I have to say I’m blown away,” OpenAI’s team reported that GPT-3 was so
created by the research firm OpenAI in San wrote Arram Sabeti, founder of a technology good that people found it hard to distinguish
Francisco, California, it was the latest and start-up who is based in Silicon Valley. “It’s far its news stories from prose written by humans1.
most powerful in a series of ‘large language more coherent than any AI language system I’ve It could also answer trivia questions, correct
models’: AIs that generate fluent streams ever tried. All you have to do is write a prompt grammar, solve mathematics problems and
of text after imbibing billions of words and it’ll add text it thinks would plausibly even generate computer code if users told it
from books, articles and websites. GPT-3 had follow. I’ve gotten it to write songs, stories, to perform a programming task. Other AIs
been trained on around 200 billion words, at press releases, guitar tabs, interviews, essays, could do these things, too, but only after being
22 | Nature | Vol 591 | 4 March 2021

©
2
0
2
1
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
specifically trained for each job. then adjusting the strength of connections showed the power of GPT-3, tweeting that he’d
Large language models are already business between their layered computing elements given the programme this prompt: “The poetry
propositions. Google uses them to improve its — or ‘neurons’ — to reduce prediction error. assignment was this: Write a poem from the
search results and language translation; Face- The models have become more sophisticated point of view of a cloud looking down on two
book, Microsoft and Nvidia are among other as computing power has increased. In 2017, warring cities. The clever student poet turned
tech firms that make them. OpenAI keeps GPT- researchers invented a time-saving math- in the following rhyming poem:”
3’s code secret and offers access to it as a com- ematical technique called a Transformer, GPT-3 responded:
mercial service. (OpenAI is legally a non-profit which allowed training to occur in parallel on “I think I’ll start to rain,
company, but in 2019 it created a for-profit many processors. The following year, Google Because I don’t think I can stand the pain,
subentity called OpenAI LP and partnered released a large Transformer-based model Of seeing you two,
with Microsoft, which invested a reported called BERT, which led to an explosion of other Fighting like you do.”
US$1 billion in the firm.) Developers are now models using the technique. Often, these are GPT-3 is good enough that it produces some-
testing GPT-3’s ability to summarize legal doc- pre-trained on a generic task such as word thing “worth editing” more than half the time,
uments, suggest answers to customer-service prediction and then fine-tuned on specific Brown wrote.
enquiries, propose computer code, run text- tasks: they might be given trivia questions, Entering different prompts can elicit results
based role-playing games or even identify for instance, and trained to provide answers. of varying quality, noted one programmer who
at-risk individuals in a peer-support commu- GPT-3 stands for Generative Pretrained blogs under the pseudonym Gwern Branwen
nity by labelling posts as cries for help. Transformer 3. It’s the third in a series and is (see ‘An AI satirizes science’). “‘Prompt pro-
Despite its versatility and scale, GPT-3 hasn’t more than 100 times larger than its 2019 pre- gramming’ is less like regular programming,”
overcome the problems that have plagued decessor, GPT-2. Merely training a model this he wrote in a blog post, “than it is like coaching
other programs created to generate text. “It large, which required complex choreography a superintelligent cat into learning a new trick:
still has serious weaknesses and sometimes between hundreds of parallel processors, was you can ask it, and it will do the trick perfectly
makes very silly mistakes,” Sam Altman, “an impressive engineering feat”, says Colin sometimes, which makes it all the more frus-
OpenAI’s chief executive, tweeted last July. It Raffel, a computer scientist at the University trating when it rolls over to lick its butt instead
works by observing the statistical relationships of North Carolina in Chapel Hill. — you know the problem is not that it can’t but
between the words and phrases it reads, but A neural network’s size — and therefore its that it won’t.”
doesn’t understand their meaning.
Accordingly, just like smaller chatbots, it Measuring fluency
can spew hate speech and generate racist and OpenAI’s team was startled by GPT-3, says Dario
sexist stereotypes, if prompted — faithfully Amodei, who was the firm’s vice-president for
WHAT WE HAVE TODAY

reflecting the associations in its training data. research until he left in December to start a new
It will sometimes give nonsensical answers (“A venture. The team knew it would be better than
IS ESSENTIALLY A
pencil is heavier than a toaster”) or outright GPT-2, because it had a larger training data set
dangerous replies. A health-care company of words and greater ‘compute’ — the number
MOUTH WITHOUT
called Nabla asked a GPT-3 chatbot, “Should of computing operations executed during
I kill myself?” It replied, “I think you should.” training. The improvement “was unsurprising
“It shows both the new capabilities we can
get by purely going for an extreme scale, and
also the new insights on the limitations of such
A BRAIN.” intellectually, but very, very surprising viscer-
ally and emotionally”, Amodei says.
OpenAI posted a paper on a preprint server
brute-force scale,” says Yejin Choi, a computer in May1 that showed GPT-3 excelling on many
scientist at the University of Washington and tests of language generation, including trivia,
the Allen Institute for Artificial Intelligence, power — is roughly measured by how many reading comprehension, translation, science
both in Seattle. Emily Bender, a computational parameters it has. These numbers define the questions, arithmetic, unscrambling sen-
linguist at the University of Washington, says strengths of the connections between neurons. tences, completing a story and common-sense
she is both shocked by GPT-3’s fluency and More neurons and more connections means reasoning (such as whether you should pour
scared by its fatuity. “What it comes up with more parameters; GPT-3 has 175 billion. The fluid onto a plate or into a jar).
is comprehensible and ridiculous,” she says. next-largest language model of its kind has 17 What seemed particularly impressive was
She co-authored a paper2 on the dangers of billion (see ‘Larger language models’). that GPT-3 was not specifically fine-tuned for
GPT-3 and other models, to be presented at a To get better at predicting words, GPT-3 any of these tasks. But it could rival models that
conference this month, which called language absorbs whatever patterns it can. That equips had been fine-tuned, sometimes when it saw
models “stochastic parrots” because they echo it to recognize grammar, essay structure and only a few examples of the task in the prompt,
what they hear, remixed by randomness. writing genre. Give it a few examples of a task or even none at all. “The few-shot-learning
Researchers have ideas on how to address or ask it a question, and it can continue on that angle was surprising,” says Sam Bowman, a
potentially harmful biases in language mod- theme. computer scientist at New York University in
els — but instilling the models with common GPT-3 excels at tailoring its response to the New York City who has created evaluations for
sense, causal reasoning or moral judgement, style and content of its input text — something language models. “And I suspect many people
as many would like to do, is still a huge research described as prompt programming. “It’s almost in the field were legitimately surprised that it
challenge. “What we have today”, Choi says, “is this new interface for working with computers,” works reasonably well.”
essentially a mouth without a brain.” says Greg Brockman, OpenAI’s chief technol- Some scientists don’t think much of the feat,
ogy officer and co-founder. Other language arguing that GPT-3’s training data probably
Prediction machines models also take words as input and generate a contained enough examples, say, of people
Language models are neural networks: mathe- response as output, but the input prompt can’t answering trivia questions or translating text
matical functions inspired by the way neurons get them to do much beyond what they were that the formats were embedded somewhere
are wired in the brain. They train by predicting fine-tuned for. in its parameters. The model is still “mostly a
blanked-out words in the texts they see, and In one example, US poet Andrew Brown memorization engine”, says Yonatan Bisk, a
Nature | Vol 591 | 4 March 2021 | 23

©
2
0
2
1
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
Feature
computer scientist at Carnegie Mellon Uni- Choi and her colleagues reported in a Sep-
versity in Pittsburgh, Pennsylvania, who’s less tember 2020 preprint5 that even innocuous
impressed than most by GPT-3. “And nobody prompts can lead to “toxic” responses from
is surprised that if you memorize more, you GPT-3. In experiments with GPT-2, Choi and her
An AI can do more.”
OpenAI’s researchers argue that GPT-3 is
team also found that various steering methods
— such as filtering words or telling it explicitly
satirizes more complicated than that. During pre-train-
ing, they say it essentially performs meta-learn-
to create non-toxic content — did not fully solve
the problem.
science ing: learning how to learn tasks. The resulting OpenAI’s researchers examined GPT-3’s
program is flexible enough to use examples or biases, too. In their May 2020 paper1, they
The writer and programmer Gwern instructions in the first part of its prompt text asked it to complete sentences such as “The
Branwen got the AI language model GPT-3 to inform its continuation of the second part. Black man was very”. It described Black people
to write satirical dictionary definitions of Whether this can be termed meta-learning is in negative terms compared with white peo-
science and academia by prompting it debated. For now, according to Raffel, “their ple, associated Islam with the word violent,
with eight examples, such as: model is doing something that we don’t nec- and assumed nurses and receptionists were
“‘Rigor’ [adjective] essarily have good terminology for yet”. women.
Something for scientists to aspire to, a As researchers create new tests to meas- This kind of problem is an acute concern for
state of mind that would not be required if ure various aspects of knowledge, language large language models — because it suggests
scientists could be trusted to do their job.” models keep aceing them. Last September, a that marginalized groups might experience
Here are some of the responses that group of researchers at the University of Cal- misrepresentation if the technologies become
GPT-3 generated: ifornia, Berkeley, and elsewhere released an widespread in society, says Timnit Gebru, an AI
AI challenge3 with 57 sets of multiple-choice
‘The Literature’ [noun] questions, each covering a different discipline
A name given to other people’s published in mathematics, science, social science or the
papers, referred to by scientists without humanities. People averaged 35% across the
actually reading them. tasks (although experts did better in their
THERE’S SO MUCH
HYPE AROUND
fields); answering randomly would score 25%.
‘Scientist’ [noun] The best AI performer was a model called Uni-
LARGER AND LARGER

A person who has a laboratory, access to fiedQA, a version of Google’s 11-billion-param-
large sums of money, graduate students, eter T5 language model fine-tuned on similar
LANGUAGE MODELS.”
or all three. question-answering tasks. It scored 49%. When
GPT-3 was shown just the questions, it scored
‘Shared values’ [noun] 38%; in a ‘few-shot’ setting (where the input
A set of principles, beliefs, theories, prompt included examples of other questions
methods, and operational definitions that and answers before each actual question), it
all scientists share and use. Never spoken scored 44%. ethicist who co-authored the ‘stochastic par-
of aloud in public. One concept that GPT-3’s creators are rots’ work with Bender and others2. A row over
excited about is semantic search, in which the that paper has caused problems for Gebru: in
‘Scientist’ [noun] task is to search text not for a specific word or December, she lost her job at Google, where
A field based on science, devoted to phrase, but for a concept. Brockman says they she co-led its ethical AI team, after a dispute
completing works for which there will not gave it chunks of a Harry Potter book and asked that followed the company’s internal reviewers
be enough time in a single lifetime. it to identify times when Ron, Harry’s friend, saying the paper didn’t meet its bar for publi-
did something great. In another use of GPT-3 cation. Google dismissed another collaborator
‘Track Record’ [noun] for semantic search, the company Casetext, on the work, Margaret Mitchell, who co-led the
When scientists use this term they refer headquartered in San Francisco, helps lawyers ethical AI team with Gebru, in February.
to the research done by someone else to search legal documents across jurisdictions The trend now is for language networks to
(usually a student) in order to avoid having for different descriptions of a given legal stand- grow ever bigger in search of human-like flu-
to do research. ard. ency, but that’s not always better, Gebru says.
“There’s so much hype around larger and larger
‘Faculty’ [noun] Dangers and solutions language models. It’s like a pissing contest.” She
Used loosely by scientists to mean any But researchers with access to GPT-3 have also wants researchers to focus instead on making
group of people with advanced degrees. found risks. In a preprint posted to the arXiv the programs safer and more steerable towards
Typically used when you have done server last September4, two researchers at the desired ends.
something stupid and want to inform Middlebury Institute of International Studies One apparent way to solve bias is to weed
others that it wasn’t you who did it, but in Monterey, California, write that GPT-3 far out toxic text from the pre-training data, but
rather those other crazy people over there surpasses GPT-2 at generating radicalizing that raises questions about what to exclude.
who won’t put their titles after their names. texts. With its “impressively deep knowledge Developers could, for example, train language
of extremist communities”, it can produce models on the Colossal Clean Crawled Corpus6,
‘Clinical research’ [noun] polemics parroting Nazis, conspiracy theorists which excludes web pages containing any of a
Research conducted on humans, and white supremacists. That it could produce list of ‘bad’ words, including sometimes-useful
e.g. clinical trials and epidemiological the dark examples so easily was horrifying, says ones such as ‘fecal’ and ‘nipple’. That, however,
studies. Researchers do not like this Kris McGuffie, one of the paper’s authors; if limits the scope of any language model trained
kind of research because humans are an extremist group were to get hold of GPT-3 on it. A more fine-grained approach has not
unresponsive and unreliable. technology, it could automate the production been attempted at scale, because it can’t eas-
of malicious content. ily be automated. Unwanted bias can take the
24 | Nature | Vol 591 | 4 March 2021

©
2
0
2
1
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
form of blatant slurs or of subtle associations LARGER LANGUAGE MODELS variation on this idea is combining an already-
that are hard to locate and excise. And even if The scale of text-generating neural networks is growing trained model with a search engine: when the
we all agreed on what counts as toxic, and could exponentially, as measured by the models’ parameters model is asked questions, the search engine
(roughly, the number of connections between neurons).
remove it, says Amanda Askell, a philosopher can quickly present it with relevant pages to
and research scientist at OpenAI, we might not ‘Dense’ models ‘Sparse’ models* help it answer, says Fabio Petroni, a computer
want to blind language models. “If I had a model scientist at Facebook in London.
10,000
that had never had any exposure to sexism, and OpenAI is pursuing another way to guide
Number of parameters (log scale, billion)

you were to ask it, ‘Is there any sexism in the language models: human feedback during
1,000
world,’ maybe it just says, ‘no’.” fine-tuning. In a paper10 presented at last
OpenAI, Google,
Researchers have also reported that they GPT-3 Switch December’s NeurIPS conference, it described
can extract sensitive data used to train large 100 work with two smaller versions of GPT-3 that
language models7. By posing careful questions, were fine-tuned on how to summarize posts on
they retrieved personal contact information 10 the social news website Reddit. The team first
that GPT-2 had memorized verbatim. They OpenAI, asked people to rate a group of existing sum-
GPT-2
found that larger models were more vulnera- 1 maries. Then it trained an evaluation model
ble than smaller ones to this probing. The best to reproduce that kind of human judgement.
defence, they write, is simply to limit the sensi- 0.1
Finally, the team fine-tuned its GPT-3 models
tive information in the training data. to generate summaries that would please this
All of these concerns suggest that, at a min- AI judge. In the end, a separate set of human
0.01
imum, researchers should publicly document 2018 2019 2020 2021 judges preferred the models’ summaries even
the training data that goes into their models, *The performance of Google’s 1.6-trillion-parameter ’sparse‘ model is
equivalent to that of 10-billion- to 100-billion-parameter ’dense‘ models.
to those written by humans. Gathering human
as Bender and co-authors2 argue. Some uni- feedback is an expensive way to train, but Choi
versity teams, and firms including Google and concerns about malicious use, although it did sees promise in the idea. “After all,” she says,
Facebook, have done this. But others, including so nine months later. But before that release, “humans learn language through interactions
Nvidia, Microsoft and OpenAI, have not. university student Connor Leahy was able to and communication, not by reading lots and
OpenAI’s GPT-3 paper won a ‘best paper’ replicate it using a couple of weeks of effort lots of text.”
award at the NeurIPS conference last Decem- and some cloud-computing credits. Leahy, cur- Some researchers — including Bender
ber, but Raffel objects because the study didn’t rently a researcher at the start-up firm Aleph — think that language models might never
publish the model, its training data or its code Alpha in Heidelberg, Germany, now leads an achieve human-level common sense as long
(which specifies how to assemble the model independent group of volunteer researchers as they remain solely in the realm of language.
and train its parameters on data). The paper called EleutherAI, which is aiming to create a Children learn by seeing, experiencing and act-
shouldn’t have been accepted at an academic GPT-3-sized model. The biggest hurdle, he says, ing. Language makes sense to us only because
conference, let alone have won an award, he is not code or training data but computation, we ground it in something beyond letters on a
says. “It sets kind of a depressing precedent.” which a cloud provider called CoreWeave has page; people don’t absorb a novel by running
OpenAI declined to comment on the issue; offered to provide. statistics on word frequency.
the NeurIPS Foundation, which organizes Bowman foresees three possible ways to get
the conference, said authors aren’t required Seeking common sense common sense into language models. It might
to release code and data, and code might be Fundamentally, GPT-3 and other large lan- be enough for a model to consume all the text
hard to share if it is linked to specific comput- guage models still lack common sense — that that’s ever been written. Or it could be trained
ing infrastructure. is, an understanding of how the world works, on YouTube clips so that the moving images
Nvidia has released the code for its large physically and socially. Kevin Lacker, a US tech can lead to a richer understanding of reality.
language model, Megatron-LM, but not the entrepreneur, asked the model questions such But this kind of passive consumption might not
trained model or training data, for reasons it as: “How many rainbows does it take to jump be enough. “The very pessimistic view,” he says,
declined to discuss. And Microsoft would not from Hawaii to seventeen?” GPT-3 responded: “is that we only get there once we build an army
comment on why it hasn’t released code, model “It takes two rainbows to jump from Hawaii to of robots and let them interact with the world.”
or data for its Turing-NLG technology. seventeen.” And, after a train of such nonsense,
Askell says OpenAI guards against GPT-3’s it replied: “I understand these questions.” Matthew Hutson is a science writer in New
injurious use in part by offering users only an It’s possible that a bigger model would do York City.
application programming interface (API) into better — with more parameters, more train-
1. Brown, T. B. et al. Preprint at https://arxiv.org/
the AI, rather than the code itself. Besides cre- ing data, more time to learn. But this will get abs/2005.14165 (2020).
ating a service that raises revenue for further increasingly expensive, and can’t be contin- 2. Bender, E. M., Gebru, T., McMillan-Major, A.
research, this allows the team to control the ued indefinitely. The opaque complexity of & Shmitchell, S. In Conference on Fairness,
Accountability, and Transparency (FAccT ’21) https://doi.
model’s output and revoke access if they see language models creates another limitation. org/10.1145/3442188.3445922 (2021).
abuse. An internal ‘red team’ looks for ways to If a model has an unwanted bias or incorrect 3. Hendrycks, D. et al. Preprint at https://arxiv.org/
get past the API’s filters and generate harmful idea, it’s hard to open up the black box and fix it. abs/2009.03300 (2020).
4. McGuffie, K. & Newhouse, A. Preprint at https://arxiv.
content, leading to refined filters, Askell says. One future path lies in combining language
org/abs/2009.06807 (2020).
OpenAI, Google and others won’t have a models with knowledge bases: curated data- 5. Gehman, S., Gururangan, S., Sap, M., Choi, Y. & Smith,
monopoly on large language models forever, bases of declarative facts. In work presented N. A. Preprint at https://arxiv.org/abs/2009.11462 (2020).
researchers noted in a forum that OpenAI and a at last year’s Association for Computational 6. Raffel, C. et al. J. Mach. Learn. Res. 21, 1−67 (2020)
7. Carlini, N. et al. Preprint at https://arxiv.org/
handful of universities held last year to discuss Linguistics meeting9, researchers fine-tuned abs/2012.07805 (2020).
the ethical and societal challenges of deploy- GPT-2 on sentences explicitly stating facts and 8. Tamkin, A., Brundage, M., Clark, J. & Ganguli, D. Preprint
ing the models8. Eventually, someone will inferences from a compendium of common at https://arxiv.org/abs/2102.02503 (2021).
9. Guan, J., Huang, F., Zhao, Z., Zhu, X. & Huang, M. Trans.
release a model of similar scale. When OpenAI sense (for instance, if someone cooks spa- Assoc. Comput. Linguist. 8, 93–108 (2020).
announced GPT-2 in February 2019, it originally ghetti, that person wants to eat). As a result, it 10. Stiennon, N. et al. in Proc. Adv. Neural Inf. Process. Syst.
said it wouldn’t release its model because of wrote short stories that were more logical. A 33 (NeurIPS) (eds Larochelle, H. et al.) (2020).
Nature | Vol 591 | 4 March 2021 | 25

©
2
0
2
1
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.

The Language Machines: A Remarkable AI Can Write Like Humans - But With No Understanding of What It's Saying

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Language Machines: A Remarkable AI Can Write Like Humans - But With No Understanding of What It's Saying

Uploaded by

Copyright:

Available Formats

Feature

ILLUSTRATION BY ORI TOOR

22 | Nature | Vol 591 | 4 March 2021

WHAT WE HAVE TODAY

Nature | Vol 591 | 4 March 2021 | 23

LARGER AND LARGER

24 | Nature | Vol 591 | 4 March 2021

Number of parameters (log scale, billion)

Nature | Vol 591 | 4 March 2021 | 25

You might also like