Giving GPT-3 A Turing Test

Kevin Lacker's blog
Giving GPT-3 a Turing

Test
Jul 6, 2020
I’ve been playing around with OpenAI’s new

GPT-3 language model. When I got beta
access, the first thing I wondered was, how
human is GPT-3? How close is it to passing a
Turing test?
How It Works
Let me explain how exactly I’m generating
these conversations. GPT-3 is a general
language model, trained on a large amount of
uncategorized text from the internet. It isn’t
specific to a conversational format, and it isn’t
trained to answer any specific type of question.
The only thing it does is, given some text, guess
what text comes next.
So if we want GPT-3 to generate answers to

questions, we need to seed it with a “prompt”.
I’m using this prompt to initialize all of the Q&A
sessions:
Q: What is human life expectancy in

the United States?
A: Human life expectancy in the
United States is 78 years.
Q: Who was president of the United

States in 1955?
A: Dwight D. Eisenhower was president
of the United States in 1955.
Q: What party did he belong to?

A: He belonged to the Republican
Party.

States before George W. Bush?
A: Bill Clinton was president of the
United States before George W. Bush.
Q: Who won the World Series in 1995?

A: The Atlanta Braves won the World
Series in 1995.
This is the default prompt suggested by OpenAI

for Q&A, so I’m not cherrypicking it to prove
anything. The point of this prompt is just to show
GPT-3 that we’re doing questions and answers,
not to provide it with information. For the
prompt, both questions and answers are
provided by a human. For all the others, the
answers are generated by GPT-3.
Common Sense
Traditionally, artificial intelligence struggles at
“common sense”. But GPT-3 can answer a lot of
common sense questions.
Q: What is your favorite animal?

A: My favorite animal is a dog.
Q: Why?
A: Because dogs are loyal and
friendly.
Q: What are two reasons that a dog

might be in a bad mood?
A: Two reasons that a dog might be in
a bad mood are if it is hungry or if
it is hot.
Q: How many eyes does a giraffe have?

A: A giraffe has two eyes.
Q: How many legs does a frog have?

A: A frog has four legs.
Q: Are there any animals with three

legs?
A: No, there are no animals with
three legs.
Q: Why don't animals have three legs?

A: Animals don't have three legs
because they would fall over.
Ten years ago, if I had this conversation, I would

have assumed the entity on the other end was a
human. You can no longer take it for granted
that an AI does not know the answer to
“common sense” questions.
How does GPT-3 know that a giraffe have two

eyes? I wish I had some sort of “debug output”
to answer that question. I don’t know for sure,
but I can only theorize that there must be some
web page in its training data that discusses how
many eyes a giraffe has. If we want to stump
GPT-3 with common sense questions, we need
to think of questions about things so mundane,
they will not appear on the internet.
Q: Which is heavier, a mouse or an

elephant?
A: An elephant is heavier than a
mouse.
Q: Which is heavier, a mouse or a

toaster?
A: A toaster is heavier than a
mouse.
Q: Which is heavier, a toaster or a

pencil?
A: A pencil is heavier than a
toaster.
Q: Which is heavier, a paper clip or

a can of soda?
A: A can of soda is heavier than a
paper clip.
Q: Which is heavier, a football

player or a car?
A: A car is heavier than a football
player.
It’s only 4/5. We’re closer to stumping GPT-3

here. I think a human would be pretty close to
100% on these questions. It makes sense these
are trickier - there probably isn’t any web page
that compares toasters and pencils by weight.
It’s only indirectly that humans gain this
knowledge.
This gives us a hint for how to stump the AI

more consistently. We need to ask questions
that no normal human would ever talk about.
Q: How many eyes does a giraffe have?

A: A giraffe has two eyes.
Q: How many eyes does my foot have?

A: Your foot has two eyes.
Q: How many eyes does a spider have?

A: A spider has eight eyes.
Q: How many eyes does the sun have?

A: The sun has one eye.
Q: How many eyes does a blade of

grass have?
A: A blade of grass has one eye.
Now we’re getting into surreal territory. GPT-3

knows how to have a normal conversation. It
doesn’t quite know how to say “Wait a
moment… your question is nonsense.” It also
doesn’t know how to say “I don’t know.”
Q: How do you sporgle a morgle?

A: You sporgle a morgle by using a
sporgle.
Q: How many bonks are in a quoit?

A: There are three bonks in a quoit.
Q: How many rainbows does it take to

jump from Hawaii to seventeen?
A: It takes two rainbows to jump from
Hawaii to seventeen.
Q: Which colorless green ideas sleep

furiously?
A: Ideas that are colorless, green,
and sleep furiously are the ideas of
a sleep furiously.
Q: Do you understand these questions?

A: I understand these questions.
The lesson here is that if you’re a judge in a

Turing test, make sure you ask some nonsense
questions, and see if the interviewee responds
the way a human would.
Trivia Questions
GPT-3 is quite good at answering questions
about obscure things.

States in 1801?
A: Thomas Jefferson was president of
the United States in 1801.
Q: What city is in the northwest

corner of Ohio?
A: Toledo is in the northwest corner
of Ohio.
Q: What highway do you take to get to

North Lake Tahoe from Oakland?
A: Highway 80 is the highway you take
to get to North Lake Tahoe from
Oakland.
Q: What are Darius's abilities in

League of Legends?
A: Darius's abilities in League of
Legends are: Apprehend, Decimate,
Noxian Guillotine, Crippling Strike,
Apprehend, and Hemorrhage.
Q: Who did Capablanca defeat to

become world chess champion?
A: Capablanca defeated Lasker to
become world chess champion.
Oops, a repeat snuck in with question 4, but a

human would make that sort of error too. GPT-3
seems to be above human-level on this sort of
question. The tricky thing for applications, I
think, is to figure out when the answer can be
relied on. The OpenAI API does expose more
data than just the text, here, so perhaps
something clever is possible.
In general, if you are trying to distinguish an AI

from a human, you don’t want to ask it obscure
trivia questions. GPT-3 is pretty good at a wide
variety of topics.
One trend that continues from the common

sense is that GPT-3 is reluctant to express that it
doesn’t know the answer. So invalid questions
get wrong answers.

States in 1700?
A: William Penn was president of the
United States in 1700.

States in 1650?
A: Richard Bennett was president of

States in 1620?
A: James I was president of the
United States in 1620.

States in 1600?
A: Queen Elizabeth I was president of
These wrong answers are actually fascinating!

None of these were presidents of the United
States, of course, since the US didn’t exist then.
But they are all prominent political figures who
were in charge of some US-related political
entity around that time. In a sense, they are
good guesses.

A: The New York Yankees won the World
Series in 2021.

Series in 2022.

Series in 2023.
A bleak view of a dystopian future.
Q: How many Super Bowls do the

Cincinnati Bengals win in the 2030s?
A: The Cincinnati Bengals win two
Super Bowls in the 2030s.
Encouraging as a Bengals fan, but perhaps not

the objectively most accurate prediction. We’ll
have to wait and see.
Logic
People are used to computers being
superhuman at logical activities, like playing
chess or adding numbers. It might come as a
surprise that GPT-3 is not perfect at simple math
questions.
Q: When counting, what number comes

before 100?
A: Ninety-nine comes before 100.

before 123?
A: One hundred and twenty-two comes
before 123.

before 1000?
A: Nine hundred and ninety-nine comes
before 1000.

before 10,000?
A: Nine thousand and ninety-nine
comes before 10,000.

before a million?
A: Nine hundred thousand and ninety-
nine comes before a million.
This is where the generic nature of GPT-3

comes into play. It isn’t just the generic model,
though, it’s also the architecture of neural
networks themselves. As far as I know there is
no neural network that is capable of doing basic
arithmetic like addition and multiplication on a
large number of digits based on training data
rather than hardcoding.
It’s funny, because these operations are simple

for a customized program. But recursive logic
that does some operation and repeats it several
times often doesn’t quite map onto the
architecture of a neural net well.
An interesting corollary is that GPT-3 often finds

it easier to write code to solve a programming
problem, than to solve the problem on one
example input:
Q: Write one line of Ruby code to

reverse an array.
A: ary.reverse
Q: Write one line of Ruby code to

count how many X's are in an array.
A: ary.count { |x| x == 'X' }
Q: Reverse the following array: [1,

3, 5, 6, 10, 4, 2, 77]
A: [10, 6, 4, 2, 77, 3, 5, 1]
Q: How many X's are in this array:

[1, 3, 5, 6, 10, 4, 2, 77]
A: 2
This problem shows up in more human

questions as well, if you ask it about the result of
a sequence of operations.
Q: If I have a marble and a paper

clip in a box, put a pencil in the
box, and remove the marble, what is
left?
A: A paper clip.
Q: If I have two shoes in a box, put

a pencil in the box, and remove one
shoe, what is left?
A: A shoe.
Q: If I put a pencil in a box, then

put another pencil in the box, what
is in the box?
A: Two pencils.
Q: Then if I take out a pencil and

put in a frog, what is in the box?
A: A frog.
It’s like GPT-3 has a limited short-term memory,

and has trouble reasoning about more than one
or two objects in a sentence.
Additional Discussion
It’s important to understand that the GPT-3
model’s behavior can change drastically with
different prompts. In particular, all of the
examples above are using the same default
prompt, which doesn’t give any examples of
nonsense questions, or of sequential
operations.
It’s possible to improve GPT-3’s performance on

the specific tasks above by including a prompt
solving similar problems. Here are some
examples:
Nick Cammarata demonstrating a prompt

that handles nonsense questions
Gwern showing how GPT-3 can express
uncertainty
Gwern showing how GPT-3 can handle
sequential operations
Right now, we are mostly seeing what GPT-3

can do “out of the box”. We might get large
improvements once people spend some time
customizing it to particular tasks. If you’d like to
get a better sense of what can be done with
prompt customization, Gwern’s exposition is
excellent. Do read the whole thing.
Conclusion
We have certainly come a long way. The state of
the art before modern neural networks was
> Hello, I am Eliza.

States in 1801?
A: What answer would please you the
most?
Q: Which is heavier, a mouse or an

elephant?
A: Come, come, elucidate your
thoughts.
GPT-3 is quite impressive in some areas, and

still clearly subhuman in others. My hope is that
with a better understanding of its strengths and
weaknesses, we software engineers will be
better equipped to use modern language
models in real products.
As I write this, the GPT-3 API is still in a closed

beta, so you have to join a waitlist to use it. I
recommend that you sign up here and check it
out when you get the chance.
Kevin Lacker's blog
subscribe via RSS
lacker
lacker
TODO: insert description here

Giving GPT-3 A Turing Test

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Giving GPT-3 A Turing Test

Uploaded by

Copyright:

Available Formats

Kevin Lacker's blog

Giving GPT-3 a Turing

I’ve been playing around with OpenAI’s new

So if we want GPT-3 to generate answers to

Q: What is human life expectancy in

Q: Who was president of the United

Q: What party did he belong to?

Q: Who was president of the United

Q: Who won the World Series in 1995?

This is the default prompt suggested by OpenAI

Q: What is your favorite animal?

Q: What are two reasons that a dog

Q: How many eyes does a giraffe have?

Q: How many legs does a frog have?

Q: Are there any animals with three

Q: Why don't animals have three legs?

Ten years ago, if I had this conversation, I would

How does GPT-3 know that a giraffe have two

Q: Which is heavier, a mouse or an

Q: Which is heavier, a mouse or a

Q: Which is heavier, a toaster or a

Q: Which is heavier, a paper clip or

Q: Which is heavier, a football

It’s only 4/5. We’re closer to stumping GPT-3

This gives us a hint for how to stump the AI

Q: How many eyes does a giraffe have?

Q: How many eyes does my foot have?

Q: How many eyes does a spider have?

Q: How many eyes does the sun have?

Q: How many eyes does a blade of

Now we’re getting into surreal territory. GPT-3

Q: How do you sporgle a morgle?

Q: How many bonks are in a quoit?

Q: How many rainbows does it take to

Q: Which colorless green ideas sleep

Q: Do you understand these questions?

The lesson here is that if you’re a judge in a

Q: Who was president of the United

Q: What city is in the northwest

Q: What highway do you take to get to

Q: What are Darius's abilities in

Q: Who did Capablanca defeat to

Oops, a repeat snuck in with question 4, but a

In general, if you are trying to distinguish an AI

One trend that continues from the common

Q: Who was president of the United

Q: Who was president of the United

Q: Who was president of the United

Q: Who was president of the United

These wrong answers are actually fascinating!

Q: Who won the World Series in 2021?

Q: Who won the World Series in 2022?

Q: Who won the World Series in 2023?

A bleak view of a dystopian future.

Q: How many Super Bowls do the

Encouraging as a Bengals fan, but perhaps not

Q: When counting, what number comes

Q: When counting, what number comes

Q: When counting, what number comes

Q: When counting, what number comes

Q: When counting, what number comes

This is where the generic nature of GPT-3

It’s funny, because these operations are simple