You are on page 1of 10

The Prisoner’s Dilemma

Seminar in Artificial Intelligence


185.054
WS 2012/13
Tamás Schmidt
9201679
November 23, 2012

The aim of this seminar paper is to present the Prisoner’s Dilemma,


closely following chapter 11 of Robert Kowalski’s book [Kow11]. Af-
ter a description of the game, its general form, the introduction of
some additional terminology and a short investigation of it’s nature,
Kowalski’s suggested solution will be presented followed by an agent
model for the game. In the concluding section a few social aspects
and a many player version of the game will be briefly addressed. The
paper will then conclude with a little empirical data from the liter-
ature. This essay is kept in an informal style but pointers to formal
definitions and additional material in the literature are given.

The game
The Prisoner’s Dilemma is a game that illustrates how players might prefer
to not cooperate although cooperation would be mutually more beneficial.
It has been extensively investigated by game theorists but also attracted in-
terest in other disciplines such as philosophy, psychology, economics and so-
cial sciences. The game was designed by Merrill Flood and Melvin Dresher
in 1950, whereas the name Prisoner’s Dilemma and its popularization is
attributed to Albert Tucker (see [Kuh09],[HHV04]pp.172,180). Game the-
ory labels games of this type, where every player makes a single move,

1
static games. As the actual order of these moves, or decisions, is not rel-
evant to the outcome they are also called simultaneous decision games
([Web07]p.61).
In short, Kowalski is narrating the following gameplay story: You and
your friend John are successfully robbing a bank, getting away with 1 mil-
lion pounds. Stopped by the police for a broken headlight, in the course of
a routine investigation, the money is being discovered. You are arrested on
the suspicion of robbery. The police, lacking witnesses and your confession,
can convict you only for the lesser offense of possessing stolen property with
a penalty of one year in jail. You are separated and questioned without
the possibility to communicate with each other. If one of you turns witness
whereas the other refuses, the witness will walk free and the other person
will be sentenced to six years in jails. If both of you turn witness you will
be sentenced to 3 years in jail each. However, if you both refuse to turn
witness you will be both sentenced to 1 year in jail. The following table
(taken from [Kow11]p.145) is summarizing your options:
Action State of the world
John turns witness John refuses
I turn witness I get 3 years in jail I get 0 years in jail
I refuse I get 6 years in jail I get 1 year in jail

The exact story being told as well as the numbers of years in jail vary
throughout the literature. But they are all instances of the general form
presented in the Stanford Encyclopedia of Philosophy([Kuh09]):
Cooperate Defect
Cooperate R,R S,T
Defect T,S P,P

The two players are called Row and Column, both having the possible
moves cooperate or defect1 . The table, referred to as decision table
or payoff matrix2 in the literature, is showing the consequences of the
players choices in pairs of (Row, Column), where the variables are to be
read Reward for mutual cooperation, Punishment for mutual defection,
Temptation for defecting alone and Sucker for cooperating alone and have
to satisfy the inequalities: T > R > P > S. The values of T, R, P, S are
the results of the utility function, which is mapping the chosen action
to a numerical value often referred to as payoff. Intuitively spending less
1
Note that cooperate corresponds to refusing, defect corresponds to turn witness, Row
corresponds to I and Column corresponds to John.
2
Such a tabular description of the game is sometimes also called normal form or
strategic form (see [Web07]p.64).

2
time in jail is of higher utility, thus following Kowalski’s instance of the
game the utility of N years in jail can be defined as −N , resulting in the
following utility functions for John and me:

uI (I coop., John coop.) = R = −1 uJohn (I coop., John coop.) = R = −1


uI (I coop., John defects) = S = −6 uJohn (I coop., John defects) = T = 0
uI (I defect, John coop.) = T = 0 uJohn (I defect, John coop.) = S = −6
uI (I defect, John defects) = P = −3 uJohn (I defect, John defects) = P = −3
Before proceeding, let’s introduce some more game theory terminology
from the literature: A strategy3 s is said to strictly dominate4 strategy
s0 if s always produces a better result than s0 . “Strategy s weakly domi-
nates s0 if s is better than s0 on at least one strategy profile and no worse
on any other.”([RN03]p.633). “A dominant strategy is a strategy that
dominates all others.”([RN03]p.633). Webb defines Pareto optimality as “A
solution is said to be Pareto optimal (after the Italian economist Vilfredo
Pareto) if no player’s payoff can be increased without decreasing the payoff
to another player. Such solutions are also termed socially efficient or just
efficient.”([Web07],pp.62-63). And according to Russel and Norvig: “An
outcome is Pareto dominated by another outcome if all players would
prefer the other outcome.”([RN03],p.633). A Nash equilibrium5 for a
two-player game is a pair of strategies, one for each player, such that no
player can do better by switching his strategy, given that the other player
sticks to his strategy. Or as Russel and Norvig put it: “An equilibrium is
essentially a local optimum in the space of policies.”6 ([RN03]p.634). Re-
turning to Kowalski’s version of the game the strategy (turn witness, turn
witness) is the only Nash equilibrium of the game. To see why, consider
the possibilities for each player:
Assuming that John turns witness your options are to turn witness as
well with a utility of −3 or to refuse with a utility of −6. If, on the
other hand, John refuses to turn witness your options are to turn witness
with a utility of 0 or to refuse as well with a utility of −1. Therefore, no

3
A formal definition of strategy can be found in [Web07]pp.24-27. In the case of the
Prisoner’s Dilemma, with just one move per player, the strategies correspond to the
possible actions, that is turn witness and refuse.
4
Strict domination is also called strong domination in the literature (e.g. in [RN03]).
Webb provides a formal definition of dominance in [Web07]p.66.
5
The name was given in honor of the mathematician John Nash (1928-present) who
proved that every finite game has an equilibrium of this type (see [RN03]p.634). For
a formal definition see [Web07]pp.69-71. Nashs idea is extensively discussed in chapter
2 of [HHV04].
6
Read policies as strategies.

3
matter what John does, defection is always the favorable option for you.
Looking at the situation from Johns perspective defection is again strictly
dominating cooperation. The rational choice7 is the action with the highest
expected utility, that is to turn witness, thereby producing a sub-optimal
result of 3 years in jail for both players, which is Pareto dominated by
the result (−1, −1), that is mutual cooperation would be the (socially)
efficient strategy. Checking the Nash equilibrium’s criterion for this result
it becomes apparent, that if you change your strategy to cooperation, while
John sticks to defection, you get 6 years in jail instead of 3, that is you are
worse off by switching. The analog argument holds for John, therefore
mutual defection is indeed a Nash equilibrium of this game8 . Russel and
Norvig are describing this property to the point: “That is the attractive
power of an equilibrium point.”([RN03]p.634).

Kowalski’s solution
Kowalski is starting with the observation that a player cannot control the
actions of others but at least he can estimate a likelihood for their possible
choices. He then suggests, to incorporate this estimate into the computation
of the overall expected utility of an action, by weighing the utility of each
possible outcome of the action by its probability and then summing up the
obtained weighted utilities. That is, given the n alternative outcomes of an
action with associated utilities u1 , u2 , ..., un and the respective probabilities
p1 , p2 , ..., pn , an overall expected utility p1 u1 +p2 u2 +...+pn un of that action
is obtained. Sticking to the definition of the utility of getting N years in jail
as −N and assuming the probability of John turning witness as P and John
refusing as (1 − P ) he presents the following decision table ([Kow11]p.150):
Action State of the world Expected utility
John turns witness John refuses with P × utility1 +
with probability P probability (1 − P ) (1 − P ) × utility2
I turn witness I get 3 years I get 0 years −3P
with utility1 = −3 with utility2 = 0
I refuse I get 6 years I get 1 year −6P − (1 − P )
with utility1 = −6 with utility2 = −1 −5P − 1
7
Informally speaking, we can call a player rational if he has a preference for the results
of his actions such that given any two actions he can always compare them (complete-
ness) and such that he can list all actions in a preference ordering (transitivity). For
a more detailed discussion and formal definitions of rationality see [Web07]pp.11-17
and [HHV04]pp.7-8.
8
Investigating all other combinations reveals, that there is indeed no other Nash equi-
librium of this game.

4
The expected utility −3P ist greater than −5P − 1 if P > − 12 which is
always the case as probabilities have values 1 ≥ P ≥ 0. Therefore defection
stays the dominant strategy so far. Kowalski argues that the reason for
this is that you did not consider the utility of your choice to John in your
payoff calculations. If you value the time John serves in jail equal to your
sentence the payoff table (taken from [Kow11]p.151) looks like this:
Action State of the world Expected utility
John turns witness John refuses with P × utility1 +
with probability P probability (1 − P ) (1 − P ) × utility2
I turn witness I get 3 years I get 0 years −6P − 6(1 − P )
John gets 3 years John gets 6 years = −6
with utility1 = −6 with utility2 = −6
I refuse I get 6 years I get 1 year −6P − 2(1 − P )
John gets 0 years John gets 1 year = −4P − 2
with utility1 = −6 with utility2 = −2

The expected utilities satisfy the inequality −4P − 2 ≥ −6 for all values
of 1 ≥ P ≥ 0, that is cooperation weakly dominates defection9 . Assuming
that John shares your beliefs the game changes entirely. Cooperation is
now the dominant strategy for both of you, therefore (refuse,refuse) is a
Nash equilibrium of the game10 and, in contrast to previous outcomes, this
result is actually Pareto optimal. However, Kowalski points out that it
might not be realistic that you value a punishment equally for John and
yourself and gives an example for a scenario where you value sentences for
John only half as severely as for yourself. Your new utility function maps
N years in jail for John to − 12 N . The corresponding decision table (taken
from [Kow11]p.151) is:
Action State of the world Expected utility
John turns witness John refuses with P × utility1 +
with probability P probability (1 − P ) (1 − P ) × utility2
I turn witness I get 3 years I get 0 years −4.5P − 3(1 − P )
John gets 3 years John gets 6 years = −1.5P − 3
with utility1 = −4.5 with utility2 = −3
I refuse I get 6 years I get 1 year −6P − 1.5(1 − P )
John gets 0 years John gets 1 year = −4.5P − 1.5
with utility1 = −6 with utility2 = −1.5

Examination of the expected utilities of cooperation and defection shows


that −4.5P − 1.5 > −1.5P − 3 if P < 0.5, that is, if you believe that the
9
Cooperation is strictly dominating defection for all probabilities 1 > P ≥ 0. In the
case of P = 1 the payoffs are equal.
10
Note that although switching your strategy to defection, while John sticked to cooper-
ation, would reduce your sentence to zero, your new utility function values this result
lower (−6) due to the higher sentence John would receive.

5
probability of John turning witness is less than 50% then refusing to turn
witness is the rational choice for you.
The assumptions taken so far always led to a symmetrical structure of
the game as depicted in the general form in the second table. That is
not necessarily always the case. John and you might be facing different
punishments (e.g. if one of you already has a criminal record and the
other does not). Also, John might estimate different probabilities than you
do and consequently come up with different utility values. Such changes
would lead to asymmetry in the game structure. Asymmetry is addressed
in section 2 of [Kuh09]. Further variants of the Prisoner’s Dilemma are
discussed in subsequent sections of [Kuh09].

An agent model of the game


The agent model developed in the course of earlier chapters of the book is
shown in fig.111 . Observations made about the world are triggering main-
tenance goals of the agent by forward reasoning. From maintenance goals
achievement goals are derived by forward reasoning. By backward reason-
ing, candidate actions are derived from the achievement goals in accordance
with the agents beliefs. The consequences of the candidate actions are then
derived by forward reasoning.
For the sake of brevity, let’s leave out some details like how an agent ac-
quires, alters or disposes beliefs. Instead let’s stick to the one maintenance
goal of the agent and its beliefs as is presented in the book. Subsequently
the observation that starts the computation is given (the following goals,
beliefs and derivations are taken from [Kow11]pp.145-146):
Maintenance Goal: if an agent requests me to perform an action,
then I respond to the request to perform the action.
Beliefs: I respond to a request to perform an action if I perform the action.
I respond to a request to perform an action
if I refuse to perform the action.

I get 3 years in jail if I turn witness and john turns witness.


I get 0 years in jail if I turn witness and john refuses to turn witness.
I get 6 years in jail if I refuse to turn witness and john turns witness.
I get 1 year in jail if I refuse to turn witness
and john refuses to turn witness.

11
The picture is taken from the online draft version of the book (http://www.doc.ic.
ac.uk/~rak/papers/newbook.pdf,p.139). The printed version of the book contains a
grayscale version of the image on p.124.

6
Figure 1: An agents observation-thought-decision-action cycle (taken from
[Kow11])

Observation: the police request me to turn witness.

If the police and turn witness in the observation can successfully be


matched with an agent and perform an action in the condition of the
maintenance goal, its conclusion is derived as achievement goal:

Achievement Goal: I respond to the request to turn witness.

Proceeding further in the agents observation-thought-decision-action


cycle two candidate actions are derived and their respective consequences:
Action 1: I turn witness.
Action 2: I refuse to turn witness.
Consequences 1: I get 3 years in jail if john turns witness.
I get 0 years in jail if john refuses to turn witness.
Consequences 2: I get 6 years in jail if john turns witness.
I get 1 year in jail if john refuses to turn witness.

Your lack of knowledge about John’s decision at this point reflects the
situation told in the introductory story. And if you make your decision
based on the consequences you just derived you end up with the well
known sub-optimal result. The solution presented in the previous section
can be realized by adding beliefs describing the consequences for John,

7
your estimates for Johns choice and your valuation of the punishment:

John gets 3 years in jail if I turn witness and john turns witness.
John gets 6 years in jail if I turn witness and john refuses to turn witness.
John gets 0 years in jail if I refuse to turn witness and john turns witness.
John gets 1 year in jail if I refuse to turn witness
and john refuses to turn witness.

John turns witness with probability P .


John refuses to turn witness with probability (1 − P ).
The utility is −N if I get N years in jail.
The utility is − 12 N if John gets N years in jail.

If the agent sticks with the principle approach of the solution as presented
above12 then, at this point, it doesn’t make a fundamental difference if
additional rules are incorporated or if a pre-assembled solution, such as a
library function, is used to calculate the formula. That is the case because
all variable subterms of the formula are already represented as beliefs and
can therefore be reevaluated by the agent.
Kowalski argues that such calculations are a normative ideal and that in
practice they are often approximated by heuristics. The exemplary goals
he is giving in [Kow11]p.152 prevent an agent from performing an action
if that would harm another person. In the case of the Prisoner’s Dilemma
that intuitively seems to corresponds to absolute (or naive) trust or to mere
altruism and does, without further refinement, effectively prevent defection.
But of course further refinement is possible. Kowalski is concluding the
chapter by making an argument for smart choices (e.g. identifying higher-
level goals) as an alternative to the computational methods used above
([Kow11]pp.152-153).

Social aspects and the free rider problem


Hargreaves-Heap and Varoufakis point out that the paradoxical quality
of the Prisoner’s Dilemma explains its fascination only in part: “But the
major reason for the interest is purely practical. Outcomes in social life
are often less than we might hope and the Prisoner’s Dilemma provides
one possible key to their understanding.”([HHV04]p.172). They also ar-
gue that communication is not sufficient to overcome the problem as the
players don’t necessarily need to stick to the agreement thus mutual de-
12
That is using the formula p1 u1 + p2 u2 + ... + pn un to calculate the overall expected
utility of every candidate action.

8
fection remains the dominant strategy. This observation is often used as
an argument for the necessity of an enforcement agency such as the gov-
ernment ([HHV04]p.174). Moreover they point out that this trust issue
arises in every imperfectly synchronized economic exchange such as pur-
chase over the internet as well as in settings with imperfect information
like the acquisition of second hand goods where the buyer discovers the
true quality of the purchased good over a period of time after the trans-
action. They show how the game can be transformed into the free rider
problem, which essentially extends the illustration of the conflict between
individual and collective rationality to many players. This extension is then
used to address group issues such as public goods, disarmament and cor-
ruption ([HHV04]pp.176-180). Kuhn discusses various approaches to this
extension and their adequacy in section 4 of [Kuh09].
Chapter 5.3 of [HHV04](pp.180-185) extensively reviews experimental re-
sults of the Prisoner’s Dilemma and the free rider problem, revealing that
people are not always sticking to the logic suggested by the game. I would
like to conclude this essay with a few observations they report, that I find
notable: In 100 plays under the surveillance of Flood and Drescher mutual
defection occured only 14 times whereas 60 times the result was mutual
cooperation. In subsequent experiments the cooperation rate was between
30% and 70%. Comparable numbers are reported from experiments with
the many player version. Moreover, the authors present a particular exper-
iment where the following findings were statistically significant:

(1) the probability of men defecting (i.e. playing R1 or C1) was 24


per cent higher than women;
(2) the probability of defecting was 33 per cent lower in the groups
where promises were allowed in the pre-play discussion;
(3) the probability of an economics major defecting was 17 per cent
higher than non- economics majors;
(4) when promise-making was not possible, economists defected 72
per cent of the time, compared with 47 per cent for non-economists,
whereas in the sessions in which promises were allowed economists
defected only 29 per cent and non-economists 26 per cent of the
time [So it seems that the difference in (3) is wholly attributable to
the play in the groups where pre-play discussion is constrained and
promise-making not possible.];
(5) the probability of defection fell as students progressed through
university (a third year student was 13 per cent less likely to defect
than a first year one). (from [HHV04]p.181)

9
The authors also report that the results (3) and (4) have been replicated
in other free rider experiments. After presenting and discussing further
empirical material, the authors conclude the chapter by covering the topic
cooperation at length ([HHV04]pp.185-209).

References
[HHV04] Shaun P. Hargreaves-Heap and Yanis Varoufakis. Game Theory
- A Critical Introduction (2nd Edition). Routledge, London and
New York, 2004.

[Kow11] Robert Kowalski. Computational Logic and Human Thinking.


Cambridge University Press, 2011.

[Kuh09] Steven Kuhn. Prisoner´s dilemma. In Edward N. Zalta, editor,


The Stanford Encyclopedia of Philosophy. Spring 2009 edition,
2009.

[RN03] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A


Modern Approach (2nd Edition). Prentice Hall, 2003.

[Web07] James N. Webb. Game Theory - Decisions, Interactions and


Evolution. Springer, London, 2007.

10

You might also like