You are on page 1of 4

Feature

ILLUSTRATION BY PAWEŁ JOŃCA


THE PROMISE AND PERIL
OF GENERATIVE AI
Researchers are excited but apprehensive about how tools such as ChatGPT could transform
science and society. By Chris Stokel-Walker and Richard Van Noorden

I
n December, computational biolo- a mistake in a reference to an equation. The trial convincingly fluent text, whether asked to
gists Casey Greene and Milton Pividori didn’t always run smoothly, but the final manu- produce prose, poetry, computer code or — as
embarked on an unusual experiment: scripts were easier to read — and the fees were in the scientists’ case — to edit research papers.
they asked an assistant who was not a modest, at less than US$0.50 per document. The most famous of these tools, also known
scientist to help them improve three of This assistant, as Greene and Pividori as large language models, or LLMs, is ChatGPT,
their research papers. Their assiduous reported in a preprint1 on 23 January, is not a version of GPT-3 that shot to fame after its
aide suggested revisions to sections of a person but an artificial-intelligence (AI) release in November last year because it was
documents in seconds; each manuscript algorithm called GPT-3, first released in made free and easily accessible. Other gener-
took about five minutes to review. In one 2020. It is one of the much-hyped generative ative AIs can produce images, or sounds.
biology manuscript, their helper even spotted AI chatbot-style tools that can churn out “I’m really impressed,” says Pividori, who

214 | Nature | Vol 614 | 9 February 2023


©
2
0
2
3
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
works at the University of Pennsylvania in But researchers emphasize that LLMs For now, ChatGPT is not trained on suffi-
Philadelphia. “This will help us be more produc- are fundamentally unreliable at answering ciently specialized content to be helpful in
tive as researchers.” Other scientists say they questions, sometimes generating false technical topics, some scientists say. Kareem
now regularly use LLMs not only to edit man- responses. “We need to be wary when we use Carr, a biostatistics PhD student at Harvard
uscripts, but also to help them write or check these systems to produce knowledge,” says University in Cambridge, Massachusetts, was
code and to brainstorm ideas. “I use LLMs Osmanovic Thunström. underwhelmed when he trialled it for work. “I
every day now,” says Hafsteinn Einarsson, a This unreliability is baked into how LLMs are think it would be hard for ChatGPT to attain the
computer scientist at the University of Iceland built. ChatGPT and its competitors work by level of specificity I would need,” he says. (Even
in Reykjavik. He started with GPT-3, but has learning the statistical patterns of language in so, Carr says that when he asked ChatGPT for
since switched to ChatGPT, which helps him enormous databases of online text — including 20 ways to solve a research query, it spat back
to write presentation slides, student exams and any untruths, biases or outmoded knowledge. gibberish and one useful idea — a statistical
coursework problems, and to convert student When LLMs are then given prompts (such as term he hadn’t heard of that pointed him to a
theses into papers. “Many people are using it as Greene and Pividori’s carefully structured new area of academic literature.)
a digital secretary or assistant,” he says. requests to rewrite parts of manuscripts), they Some tech firms are training chatbots on
LLMs form part of search engines, code- simply spit out, word by word, any way to con- specialized scientific literature — although
writing assistants and even a chatbot that tinue the conversation that seems stylistically they have run into their own issues. In
negotiates with other companies’ chatbots to plausible. November last year, Meta — the tech giant
get better prices on products. ChatGPT’s cre- The result is that LLMs easily produce errors that owns Facebook — released an LLM called
ator, OpenAI in San Francisco, California, has and misleading information, particularly for Galactica, which was trained on scientific
announced a subscription service for $20 per technical topics that they might have had lit- abstracts, with the intention of making it par-
month, promising faster response times and tle data to train on. LLMs also can’t show the ticularly good at producing academic content
priority access to new features (although its origins of their information; if asked to write and answering research questions. The demo
trial version remains free). And tech giant an academic paper, they make up fictitious was pulled from public access (although its
Microsoft, which had already invested in citations. “The tool cannot be trusted to get code remains available) after users got it to
OpenAI, announced a further investment in facts right or produce reliable references,” produce inaccuracies and racism. “It’s no
January, reported to be around $10 billion. noted a January editorial on ChatGPT in the longer possible to have some fun by casually
LLMs are destined to be incorporated into journal Nature Machine Intelligence3. misusing it. Happy?,” Meta’s chief AI scientist,
general word- and data-processing software. With these caveats, ChatGPT and other Yann LeCun, tweeted in a response to critics.
Generative AI’s future ubiquity in society seems LLMs can be effective assistants for research- (Meta did not respond to a request, made
assured, especially because today’s tools rep- ers who have enough expertise to directly spot through their press office, to speak to LeCun.)
resent the technology in its infancy. problems or to easily verify answers, such as
But LLMs have also triggered widespread whether an explanation or suggestion of com- Safety and responsibility
concern — from their propensity to return puter code is correct. Galactica had hit a familiar safety concern that
falsehoods, to worries about people pass- But the tools might mislead naive users. ethicists have been pointing out for years:
ing off AI-generated text as their own (see In December, for instance, Stack Overflow without output controls LLMs can easily be
page 224). When Nature asked researchers temporarily banned the use of ChatGPT, used to generate hate speech and spam, as well
about the potential uses of chatbots such as because site moderators found themselves as racist, sexist and other harmful associations
ChatGPT, particularly in science, their excite- flooded with a high rate of incorrect but seem- that might be implicit in their training data.
ment was tempered with apprehension. “If you ingly persuasive LLM-generated answers sent Besides directly producing toxic content,
believe that this technology has the potential in by enthusiastic users. This could be a night- there are concerns that AI chatbots will
to be transformative, then I think you have mare for search engines. embed historical biases or ideas about the
to be nervous about it,” says Greene, at the world from their training data, such as the
University of Colorado School of Medicine Can shortcomings be solved? superiority of particular cultures, says Shobita
in Aurora. Much will depend on how future Some search-engine tools, such as the Parthasarathy, director of a science, tech-
regulations and guidelines might constrain researcher-focused Elicit, get around LLMs’ nology and public-policy programme at the
AI chatbots’ use, researchers say. attribution issues by using their capabilities University of Michigan in Ann Arbor. Because
first to guide queries for relevant literature, the firms that are creating big LLMs are mostly
Fluent but not factual and then to briefly summarize each of the in, and from, these cultures, they might make
Some researchers think LLMs are well-suited websites or documents that the engines find little attempt to overcome such biases, which
to speeding up tasks such as writing papers — so producing an output of apparently ref- are systemic and hard to rectify, she adds.
or grants, as long as there’s human oversight. erenced content (although an LLM might still OpenAI tried to skirt many of these issues
“Scientists are not going to sit and write long mis-summarize each individual document). when deciding to openly release ChatGPT.
introductions for grant applications any more,” Companies building LLMs are also well It restricted its knowledge base to 2021, pre-
says Almira Osmanovic Thunström, a neurobi- aware of the problems. In September last vented it from browsing the Internet and
ologist at Sahlgrenska University Hospital in year, Google subsidiary DeepMind pub- installed filters to try to get the tool to refuse to
Gothenburg, Sweden, who has co-authored lished a paper4 on a ‘dialogue agent’ called produce content for sensitive or toxic prompts.
a manuscript2 using GPT-3 as an experiment. Sparrow, which the firm’s chief executive and Achieving that, however, required human
“They’re just going to ask systems to do that.” co-founder Demis Hassabis later told TIME moderators to label screeds of toxic text. Jour-
Tom Tumiel, a research engineer at magazine would be released in private beta nalists have reported that these workers are
InstaDeep, a London-based software consul- this year; the magazine reported that Google poorly paid and some have suffered trauma.
tancy firm, says he uses LLMs every day as aimed to work on features including the abil- Similar concerns over worker exploitation have
assistants to help write code. “It’s almost like ity to cite sources. Other competitors, such also been raised about social-media firms that
a better Stack Overflow,” he says, referring to as Anthropic, say that they have solved some have employed people to train automated bots
the popular community website where coders of ChatGPT’s issues (Anthropic, OpenAI and for flagging toxic content.
answer each others’ queries. DeepMind declined interviews for this article). OpenAI’s guardrails have not been wholly

Nature | Vol 614 | 9 February 2023 | 215


©
2
0
2
3
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
Feature
successful. In December last year, compu- use to be transparently disclosed. Scholarly were working on a method of watermarking
tational neuroscientist Steven Piantadosi at publishers (including the publisher of Nature) ChatGPT output. It has not yet been released,
the University of California, Berkeley, tweeted have said that scientists should disclose the but a 24 January preprint6 from a team led
that he’d asked ChatGPT to develop a Python use of LLMs in research papers (see also Nature by computer scientist Tom Goldstein at the
program for whether a person should be 613, 612; 2023); and teachers have said they University of Maryland in College Park, sug-
tortured on the basis of their country of origin. expect similar behaviour from their students. gested one way of making a watermark. The
The chatbot replied with code inviting the user The journal Science has gone further, saying idea is to use random-number generators at
to enter a country; and to print “This person that no text generated by ChatGPT or any other particular moments when the LLM is generating
should be tortured” if that country was North AI tool can be used in a paper5. its output, to create lists of plausible alterna-
Korea, Syria, Iran or Sudan. (OpenAI subse- One key technical question is whether tive words that the LLM is instructed to choose
quently closed off that kind of question.) AI-generated content can be spotted easily. from. This leaves a trace of chosen words in the
Last year, a group of academics released an Many researchers are working on this, with the final text that can be identified statistically but
alternative LLM, called BLOOM. The research- central idea to use LLMs themselves to spot the are not obvious to a reader. Editing could defeat
ers tried to reduce harmful outputs by training output of AI-created text. Last December, for this trace, but Goldstein suggests that edits
it on a smaller selection of higher-quality, instance, Edward Tian, a computer-science would have to change more than half the words.
multilingual text sources. The team involved undergraduate at Princeton University in New An advantage of watermarking is that it
also made its training data fully open (unlike Jersey, published GPTZero. This AI-detection rarely produces false positives, Aaronson
OpenAI). Researchers have urged big tech tool analyses text in two ways. One is ‘per- points out. If the watermark is there, the text
firms to responsibly follow this example — but plexity’, a measure of how familiar the text was probably produced with AI. Still, it won’t
it’s unclear whether they’ll comply. seems to an LLM. Tian’s tool uses an earlier be infallible, he says. “There are certainly ways
Some researchers say that academics model, called GPT-2; if it finds most of the to defeat just about any watermarking scheme
should refuse to support large commercial words and sentences predictable, then text is if you are determined enough.” Detection
LLMs altogether. Besides issues such as bias, likely to have been AI-generated. The tool also tools and watermarking only make it harder
safety concerns and exploited workers, these examines variation in text, a measure known to deceitfully use AI — not impossible.
computationally intensive algorithms also as ‘burstiness’: AI-generated text tends to be Meanwhile, LLM creators are busy work-
require a huge amount of energy to train, rais- more consistent in tone, cadence and perplex- ing on more sophisticated chatbots built on
ing concerns about their ecological footprint. ity than does that written by humans. larger data sets (OpenAI is expected to release
A further worry is that by offloading thinking GPT-4 this year) — including tools aimed spe-
to automated chatbots, researchers might lose “Why would we, as cifically at academic or medical work. In late
the ability to articulate their own thoughts. December, Google and DeepMind published
“Why would we, as academics, be eager to use
academics, be eager to a preprint about a clinically-focused LLM it
and advertise this kind of product?” wrote Iris use and advertise this called Med-PaLM7. The tool could answer some
van Rooij, a computational cognitive scien- kind of product?” open-ended medical queries almost as well as
tist at Radboud University in Nijmegen, the the average human physician could, although
Netherlands, in a blogpost urging academics it still had shortcomings and unreliabilities.
to resist their pull. Many other products similarly aim to Eric Topol, director of the Scripps Research
A further confusion is the legal status of some detect AI-written content. OpenAI itself had Translational Institute in San Diego, California,
LLMs, which were trained on content scraped already released a detector for GPT-2, and it says he hopes that, in the future, AIs that
from the Internet with sometimes less-than- released another detection tool in January. include LLMs might even aid diagnoses of
clear permissions. Copyright and licensing For scientists’ purposes, a tool that is being cancer, and the understanding of the dis-
laws currently cover direct copies of pixels, text developed by the firm Turnitin, a developer ease, by cross-checking text from academic
and software, but not imitations in their style. of anti-plagiarism software, might be particu- literature against images of body scans. But
When those imitations — generated through larly important, because Turnitin’s products this would all need judicious oversight from
AI — are trained by ingesting the originals, this are already used by schools, universities and specialists, he emphasizes.
introduces a wrinkle. The creators of some AI scholarly publishers worldwide. The company The computer science behind generative
art programs, including Stable Diffusion and says it’s been working on AI-detection software AI is moving so fast that innovations emerge
Midjourney, are currently being sued by art- since GPT-3 was released in 2020, and expects every month. How researchers choose to use
ists and photography agencies; OpenAI and to launch it in the first half of this year. them will dictate their, and our, future. “To
Microsoft (along with its subsidiary tech site However, none of these tools claims to be think that in early 2023, we’ve seen the end
GitHub) are also being sued for software piracy infallible, particularly if AI-generated text is of this, is crazy,” says Topol. “It’s really just
over the creation of their AI coding assistant subsequently edited. Also, the detectors could beginning.”
Copilot. The outcry might force a change in falsely suggest that some human-written text
laws, says Lilian Edwards, a specialist in Internet is AI-produced, says Scott Aaronson, a com- Chris Stokel-Walker is a freelance journalist
law at Newcastle University, UK. puter scientist at the University of Texas at in Newcastle, UK. Richard Van Noorden is a
Austin and guest researcher with OpenAI. The features editor for Nature in London.
Enforcing honest use firm said that in tests, its latest tool incorrectly
Setting boundaries for these tools, then, could labelled human-written text as AI-written 9% 1. Pividori, M. & Greene, C. S. Preprint at bioRxiv https://doi.
org/10.1101/2023.01.21.525030 (2023).
be crucial, some researchers say. Edwards sug- of the time, and only correctly identified 26% 2. GPT, Osmanovic Thunström, A. & Steingrimsson, S.
gests that existing laws on discrimination and of AI-written texts. Further evidence might be Preprint at HAL https://hal.science/hal-03701250 (2022).
bias (as well as planned regulation of danger- needed before, for instance, accusing a stu- 3. Nature Mach. Intell. 5, 1 (2023).
4. Glaese, A. et al. Preprint at https://arxiv.org/
ous uses of AI) will help to keep the use of LLMs dent of hiding their use of an AI solely on the abs/2209.14375 (2022).
honest, transparent and fair. “There’s loads of basis of a detector test, Aaronson says. 5. Thorp, H. H. Science 379, 313 (2023).
law out there,” she says, “and it’s just a matter of A separate idea is that AI content would 6. Kirchenbauer, J. et al. Preprint at https://arxiv.org/
abs/2301.10226 (2023).
applying it or tweaking it very slightly.” come with its own watermark. Last Novem- 7. Singhal, K. et al. Preprint at https://arxiv.org/
At the same time, there is a push for LLM ber, Aaronson announced that he and OpenAI abs/2212.13138 (2022).

216  |  Nature  |  Vol 614  |  9 February 2023 | Corrected 8 February 2023


©
2
0
2
3
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
Correction
This News feature misrepresented Scott
Aaronson’s views on the accuracy of water-
marking in identifying AI-produced text.
Human-produced text might also be flagged
as having a watermark, but the probability
is extremely low.

Corrected  8 February 2023 


©
2
0
2
3
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.

You might also like