You are on page 1of 1

Kevin Lacker's blog

Giving GPT-3 a Turing


Test
Jul 6, 2020

I’ve been playing around with OpenAI’s new


GPT-3 language model. When I got beta
access, the first thing I wondered was, how
human is GPT-3? How close is it to passing a
Turing test?

How It Works
Let me explain how exactly I’m generating
these conversations. GPT-3 is a general
language model, trained on a large amount of
uncategorized text from the internet. It isn’t
specific to a conversational format, and it isn’t
trained to answer any specific type of question.
The only thing it does is, given some text, guess
what text comes next.

So if we want GPT-3 to generate answers to


questions, we need to seed it with a “prompt”.
I’m using this prompt to initialize all of the Q&A
sessions:

Q: What is human life expectancy in


the United States?
A: Human life expectancy in the
United States is 78 years.

Q: Who was president of the United


States in 1955?
A: Dwight D. Eisenhower was president
of the United States in 1955.

Q: What party did he belong to?


A: He belonged to the Republican
Party.

Q: Who was president of the United


States before George W. Bush?
A: Bill Clinton was president of the
United States before George W. Bush.

Q: Who won the World Series in 1995?


A: The Atlanta Braves won the World
Series in 1995.

This is the default prompt suggested by OpenAI


for Q&A, so I’m not cherrypicking it to prove
anything. The point of this prompt is just to show
GPT-3 that we’re doing questions and answers,
not to provide it with information. For the
prompt, both questions and answers are
provided by a human. For all the others, the
answers are generated by GPT-3.

Common Sense
Traditionally, artificial intelligence struggles at
“common sense”. But GPT-3 can answer a lot of
common sense questions.

Q: What is your favorite animal?


A: My favorite animal is a dog.

Q: Why?
A: Because dogs are loyal and
friendly.

Q: What are two reasons that a dog


might be in a bad mood?
A: Two reasons that a dog might be in
a bad mood are if it is hungry or if
it is hot.

Q: How many eyes does a giraffe have?


A: A giraffe has two eyes.

Q: How many legs does a frog have?


A: A frog has four legs.

Q: Are there any animals with three


legs?
A: No, there are no animals with
three legs.

Q: Why don't animals have three legs?


A: Animals don't have three legs
because they would fall over.

Ten years ago, if I had this conversation, I would


have assumed the entity on the other end was a
human. You can no longer take it for granted
that an AI does not know the answer to
“common sense” questions.

How does GPT-3 know that a giraffe have two


eyes? I wish I had some sort of “debug output”
to answer that question. I don’t know for sure,
but I can only theorize that there must be some
web page in its training data that discusses how
many eyes a giraffe has. If we want to stump
GPT-3 with common sense questions, we need
to think of questions about things so mundane,
they will not appear on the internet.

Q: Which is heavier, a mouse or an


elephant?
A: An elephant is heavier than a
mouse.

Q: Which is heavier, a mouse or a


toaster?
A: A toaster is heavier than a
mouse.

Q: Which is heavier, a toaster or a


pencil?
A: A pencil is heavier than a
toaster.

Q: Which is heavier, a paper clip or


a can of soda?
A: A can of soda is heavier than a
paper clip.

Q: Which is heavier, a football


player or a car?
A: A car is heavier than a football
player.

It’s only 4/5. We’re closer to stumping GPT-3


here. I think a human would be pretty close to
100% on these questions. It makes sense these
are trickier - there probably isn’t any web page
that compares toasters and pencils by weight.
It’s only indirectly that humans gain this
knowledge.

This gives us a hint for how to stump the AI


more consistently. We need to ask questions
that no normal human would ever talk about.

Q: How many eyes does a giraffe have?


A: A giraffe has two eyes.

Q: How many eyes does my foot have?


A: Your foot has two eyes.

Q: How many eyes does a spider have?


A: A spider has eight eyes.

Q: How many eyes does the sun have?


A: The sun has one eye.

Q: How many eyes does a blade of


grass have?
A: A blade of grass has one eye.

Now we’re getting into surreal territory. GPT-3


knows how to have a normal conversation. It
doesn’t quite know how to say “Wait a
moment… your question is nonsense.” It also
doesn’t know how to say “I don’t know.”

Q: How do you sporgle a morgle?


A: You sporgle a morgle by using a
sporgle.

Q: How many bonks are in a quoit?


A: There are three bonks in a quoit.

Q: How many rainbows does it take to


jump from Hawaii to seventeen?
A: It takes two rainbows to jump from
Hawaii to seventeen.

Q: Which colorless green ideas sleep


furiously?
A: Ideas that are colorless, green,
and sleep furiously are the ideas of
a sleep furiously.

Q: Do you understand these questions?


A: I understand these questions.

The lesson here is that if you’re a judge in a


Turing test, make sure you ask some nonsense
questions, and see if the interviewee responds
the way a human would.

Trivia Questions
GPT-3 is quite good at answering questions
about obscure things.

Q: Who was president of the United


States in 1801?
A: Thomas Jefferson was president of
the United States in 1801.

Q: What city is in the northwest


corner of Ohio?
A: Toledo is in the northwest corner
of Ohio.

Q: What highway do you take to get to


North Lake Tahoe from Oakland?
A: Highway 80 is the highway you take
to get to North Lake Tahoe from
Oakland.

Q: What are Darius's abilities in


League of Legends?
A: Darius's abilities in League of
Legends are: Apprehend, Decimate,
Noxian Guillotine, Crippling Strike,
Apprehend, and Hemorrhage.

Q: Who did Capablanca defeat to


become world chess champion?
A: Capablanca defeated Lasker to
become world chess champion.

Oops, a repeat snuck in with question 4, but a


human would make that sort of error too. GPT-3
seems to be above human-level on this sort of
question. The tricky thing for applications, I
think, is to figure out when the answer can be
relied on. The OpenAI API does expose more
data than just the text, here, so perhaps
something clever is possible.

In general, if you are trying to distinguish an AI


from a human, you don’t want to ask it obscure
trivia questions. GPT-3 is pretty good at a wide
variety of topics.

One trend that continues from the common


sense is that GPT-3 is reluctant to express that it
doesn’t know the answer. So invalid questions
get wrong answers.

Q: Who was president of the United


States in 1700?
A: William Penn was president of the
United States in 1700.

Q: Who was president of the United


States in 1650?
A: Richard Bennett was president of
the United States in 1650.

Q: Who was president of the United


States in 1620?
A: James I was president of the
United States in 1620.

Q: Who was president of the United


States in 1600?
A: Queen Elizabeth I was president of
the United States in 1600.

These wrong answers are actually fascinating!


None of these were presidents of the United
States, of course, since the US didn’t exist then.
But they are all prominent political figures who
were in charge of some US-related political
entity around that time. In a sense, they are
good guesses.

Q: Who won the World Series in 2021?


A: The New York Yankees won the World
Series in 2021.

Q: Who won the World Series in 2022?


A: The New York Yankees won the World
Series in 2022.

Q: Who won the World Series in 2023?


A: The New York Yankees won the World
Series in 2023.

A bleak view of a dystopian future.

Q: How many Super Bowls do the


Cincinnati Bengals win in the 2030s?
A: The Cincinnati Bengals win two
Super Bowls in the 2030s.

Encouraging as a Bengals fan, but perhaps not


the objectively most accurate prediction. We’ll
have to wait and see.

Logic
People are used to computers being
superhuman at logical activities, like playing
chess or adding numbers. It might come as a
surprise that GPT-3 is not perfect at simple math
questions.

Q: When counting, what number comes


before 100?
A: Ninety-nine comes before 100.

Q: When counting, what number comes


before 123?
A: One hundred and twenty-two comes
before 123.

Q: When counting, what number comes


before 1000?
A: Nine hundred and ninety-nine comes
before 1000.

Q: When counting, what number comes


before 10,000?
A: Nine thousand and ninety-nine
comes before 10,000.

Q: When counting, what number comes


before a million?
A: Nine hundred thousand and ninety-
nine comes before a million.

This is where the generic nature of GPT-3


comes into play. It isn’t just the generic model,
though, it’s also the architecture of neural
networks themselves. As far as I know there is
no neural network that is capable of doing basic
arithmetic like addition and multiplication on a
large number of digits based on training data
rather than hardcoding.

It’s funny, because these operations are simple


for a customized program. But recursive logic
that does some operation and repeats it several
times often doesn’t quite map onto the
architecture of a neural net well.

An interesting corollary is that GPT-3 often finds


it easier to write code to solve a programming
problem, than to solve the problem on one
example input:

Q: Write one line of Ruby code to


reverse an array.
A: ary.reverse

Q: Write one line of Ruby code to


count how many X's are in an array.
A: ary.count { |x| x == 'X' }

Q: Reverse the following array: [1,


3, 5, 6, 10, 4, 2, 77]
A: [10, 6, 4, 2, 77, 3, 5, 1]

Q: How many X's are in this array:


[1, 3, 5, 6, 10, 4, 2, 77]
A: 2

This problem shows up in more human


questions as well, if you ask it about the result of
a sequence of operations.

Q: If I have a marble and a paper


clip in a box, put a pencil in the
box, and remove the marble, what is
left?
A: A paper clip.

Q: If I have two shoes in a box, put


a pencil in the box, and remove one
shoe, what is left?
A: A shoe.

Q: If I put a pencil in a box, then


put another pencil in the box, what
is in the box?
A: Two pencils.

Q: Then if I take out a pencil and


put in a frog, what is in the box?
A: A frog.

It’s like GPT-3 has a limited short-term memory,


and has trouble reasoning about more than one
or two objects in a sentence.

Additional Discussion
It’s important to understand that the GPT-3
model’s behavior can change drastically with
different prompts. In particular, all of the
examples above are using the same default
prompt, which doesn’t give any examples of
nonsense questions, or of sequential
operations.

It’s possible to improve GPT-3’s performance on


the specific tasks above by including a prompt
solving similar problems. Here are some
examples:

Nick Cammarata demonstrating a prompt


that handles nonsense questions
Gwern showing how GPT-3 can express
uncertainty
Gwern showing how GPT-3 can handle
sequential operations

Right now, we are mostly seeing what GPT-3


can do “out of the box”. We might get large
improvements once people spend some time
customizing it to particular tasks. If you’d like to
get a better sense of what can be done with
prompt customization, Gwern’s exposition is
excellent. Do read the whole thing.

Conclusion
We have certainly come a long way. The state of
the art before modern neural networks was

> Hello, I am Eliza.

Q: Who was president of the United


States in 1801?
A: What answer would please you the
most?

Q: Which is heavier, a mouse or an


elephant?
A: Come, come, elucidate your
thoughts.

GPT-3 is quite impressive in some areas, and


still clearly subhuman in others. My hope is that
with a better understanding of its strengths and
weaknesses, we software engineers will be
better equipped to use modern language
models in real products.

As I write this, the GPT-3 API is still in a closed


beta, so you have to join a waitlist to use it. I
recommend that you sign up here and check it
out when you get the chance.

Kevin Lacker's blog

subscribe via RSS

lacker
lacker

TODO: insert description here

You might also like