You are on page 1of 9

Report 2

Bingqian Yi (r0726769), Fabian Fingerhut (r0736509), Maria Camila Alvarez T. (r0731521)

13th May 2019

Abstract
For task two, we conducted experiments on Rock-Paper-Scissor game including replicator dynamics; Q-
learning learning trajectory in iterative RPS game; fictitious play in RPS game; Q-learning with opponent
modeling. We find the agent will converge to Nash equilibrium(also Pareto optimal), independent of
initialization.
For task three, we developed and trained a Deep Q learning model to play the Harvest game in a multi-
agent setting. Our solution provides an agent capable of appropriate of a resource for his own benefice
while dealing with tragedy of the commons dilemma. Our solution consists of a DQN with two hidden
layers, a reward function that incentives the consumption of apples, being in the first positions among
the players and penalizes for over-depleting an area or shooting and a compact state representation of the
environment.

1 Opponent Modelling
In this part we work on the Rock-Paper-Scissor game. The replicator dynamics of this game and the
reward matrix we used in all experiments shown in Fig1.

Figure 1: Illustration of replicator dynamics for unbiased Rock-Paper-Scissors game with the given payoff
matrix. 2

1.1 Iterated Rock Paper Scissor game with Q learning


RPS is stateless. When using Q learning it has only one state, accumulating the Q value for three actions.
Theory and algorithm are the same as in Matching Pennies which we investigated in the first task. The
2
Plotted using egtplot for python. [2]

1
best strategy is to choose a random action. In this situation, Nash equilibrium is also Pareto optimal. we
apply same Q learning algorithm for experiment.

Q: Whether our Q-learning agent will converge to Nash and Pareto equilibrium with different
initialization? The learning process shown at Fig. 2 depicts that even initialized with a biased action
probability, two agents will learn the best strategy: playing randomly, the probability of three action will
become the almost the same. It will converge to the mixed Nash equilibrium and also to Pareto Optimality
independent of the initial value, see Fig. 3.

Figure 2: Action probability in learning process of iterative RPS game, with biased action probability
initialization. (1) Left: Agent1 initialized with high probability to perform Rock; (2) Right: Agent2
initialized with high probability to perform Paper.

Figure 3: Learning trajectories for independent Learners(Q-learning agent) with different initialization
derived from Q value.

2
1.2 Fictitious Play
Another simple method to archive coordination between agents is fictitious play. In fictitious play, the agent
keeps a model of other agents by storing their previous moves and deriving their respective probabilities to
model the probability its own move. For this it assumes that the other agent plays its action a0 ∈ {r, p, s}
with probability

c(a0 )
P (a0 ) = (1)
c(r0 ) + c(p0 ) + c(s0 )

Figure 4: Behaviour of Agent performing fictitious play against (1) random agent, (2) agent only playing
rock and (3) agent with heavy bias towards playing paper.

Fig. 4 compares an agent applying fictitious play against biased opponents: By updating its beliefs
about the probability with witch the enemy is playing it will play a more favorable move against its
opponent. Note that when playing against an opponent that only chooses one specific action the optimal
solution would be to play the actions counter (i.e. play always paper against an enemy that always plays
rock). Our implementation however will choose to play the counter only in with P = 0.66. This is owed
to the fact how we compute the agents next move aself given a reward matrix A as the normalized action
probability with A · (P (a0r ), P (a0p ), P (a0s ))T .

1.3 Combining Q-Learning with Opponent Modelling


In multiagent system, it’s meaningful to be aware of other agents’ action. Modeling opponent action
probability or policy will instruct the agent to choose its own action thus performance better. There are
two forms of multiagent RL[1]: one is Independent Learner(IL), which ignore the existence of other agent,
consider other agent as part of environment so the influence is reflected from reward. Another is Joint
Action Learner(JAL), learn the value or policy from joint action, which is its own action in conjunction
with other agents’ actions.
2
Plotted using ternary for python. [2]

3
Caroline and Craig proposed a method to combine Q-learning with opponent modeling[?]. Instead of
using single action in Q learning algorithm, they used joint action of all agents. Each JAL agent maintains
empirical distribution of opponents. Based on joint action Q value and probability of opponents’ action,
they calculated a expected value of each action for every agent. The formula is:
X Y j
EV (ai ) = Q(a− i ∧ ai ) P ra−i [j] (2)
a−i ∈A−i j6=i

ai is my agent action, a−i are all opponents actions. Pr is probability of an opponents action, products them
together. Actually this method is measuring agent every action based on opponents action probability and
the estimate reward of joint action. Then use this expected value as Q value with Boltzmann exploration
to choose action.
Research Question:

• What the learning trajectory of joint action learner looks like?


Answer: the learning process of two joint action learner shows in Fig.4. They are more sensitive
to opponents’ action. As shown in the diagram, agent1 probability of Paper increase rapidly, at the
same time probability of Scissor decrease sharply, because agent2 prefer to perform Rock. But at the
end of game, the probability of each action will eventually become almost same.
The learning trajectory of joint action learner in Fig6, the agent will converge to Mixed Nash equi-
librium also Pareto optimal, independent with the initialization.

Figure 5: Action probability in learning process of Joint action learner in RPS game, with biased action
probability initialization. (1) Left: Agent1 initialized with high probability to perform Scissor; (2)Right:
Agent2 initialized with high probability to perform Rock.

2 Harvest Game Agent


Both Prisoner’s Dilemma, match penny and Rock Paper Scissors are stateless game, it’s very easy to
measure each state and each action with Q-value. But in some complexity situation, it’s impossible to
enumerate all state, like the harvest game.In this case, deep Q-learning algorithm shows its superiority
since it doesn’t store states and maintain Q table, directly use a neural network model to predict Q value
of each state.

4
Figure 6: Learning trajectories for joint action Learners(Q-learning with opponent modeling) with different
initialization.

2.1 Algorithm
2.1.1 DQN Algorithm
The basic idea behind many reinforcement learning algorithms is to estimate the action value function,
by using the Bellman equation as an iterative update, until it converges to optimal action value [4].In Q
learning, it’s a typically linear function approximator. We can also use a neural network with weight θ as
the function approximator, such a network is Q-network. A Q-network can be trained by minimizing mean
square error between estimated action value and target value at each iteration[4].

Li (θi ) = Es,a∼p(.) [(y − Q(s, a; θi ))2 ] (3)

where p(.) is action distribution of state s, y is the target of iteration i:

yi = Es‘∼ [r + γmaxa0 Q(s0 , a0 ; θi−1 )|s, a] (4)

The target is based on network weights and reward. When optimizing the loss function Li (θi ) of iteration
i, the parameters from the previous iteration θi−1 are fixed[4]. Usually we optimize the loss function by
stochastic gradient descent.
This algorithm is model-free, it directly use samples from agents, without explicitly construct model of
environment and agent. It’s also off-policy, learning maximizing action value at each step based on reward
directly. In practice, the action is often selected by -greedy strategy, which means with  probability to
select randomly, 1- probability selecting greedy(select maximum Q-value action)[4].

2.1.2 Experience Replay


It’s obviously the states of agent is successive, thus there is a strong correlation between concussive samples.
When learning directly from concussive sample, the current parameters determine the next data sample
that parameters are trained on[4], which means the training distribution will switch according to the

5
action the agent take. In this case, parameters could get stuck in a poor local minimum, or even diverge
catastrophically[4].
The technique experience replay alleviates the problem mentioned above. we store agent’s experience
et = (st , at , rt , st+1 ) at each time-step in a data set. st is current state, at is current taken action, rt is the
immediate reward by taking at , st+1 is the next state by taking action st+1 . We train Q-network every
time with a mini-batch of data randomly sampled from experience data set. This approach has several
advantages[4]: First, each step experience is potentially used in many weight updates, which improve data
efficiency. Second, it breaks constructiveness between samples, the training data distribution is averaged
over many of its previous states. using experience replay helps to smooth our learning and avoid oscillations
or divergence in the parameters. Our training process will be much stable with experience replay technique.

2.2 Implementation
2.2.1 Feature Vectors
Over the course of training we tried multiple combinations of input features with a varying size between 18
and 225 variables (see table 1). The agent is only informed about a 15·15-sized window with the agent being
placed in the centre. If an object is placed one tile left of the agent the global coordinates of that object
and respectively the agent within the 36 · 16-sized window of the harvest game don’t matter. Therefore we
decided to use a local representation of objects: Given an agent placed at a = (xa , ya ) and an object placed
at o = (xo , yo ), then the object can be represented towards the agent as o00 = (xa − x0o , ya − yo0 ). In order
to accommodate the agent being able to look and go through walls we had to transform the coordinates
of objects across the wall first (o0 ).

ID Description Feature Size in the final model


A relative position of apples and enemies towards agent 152 − 1 = 224 no
B1 number of apples in a x · x sized area (pooling) 152 /x2 yes (x = 5)
B2 number of enemies in a x · x sized area (pooling) 152 /x2 no
C checking position immediately around agent for apples 8 yes
D1 orientation of agent (single value encoding) 1 no
D2 orientation of agent (one-hot encoding) 4 yes
E other agent in shooting sight 1 no

Table 1: Tested feature vectors


All input features are related to spatial information. This includes the position of enemies (A, B2, E),
the position of apples (A, B1, C) and the orientation of the agent (D1, D2).
Creating a grid representing all tiles around the agent (A) is closest to the representations that were
applied to similar games that included observing RGB-values [3, 5]. In order to achieve a faster learning
process we decided however to pool over the values (B1). The final model does not include a feature
representation for other agents being in shooting sight (E) nor any spatial information about them (A,
B2).

2.2.2 Reward Functions


In this game, the dilemma is the short-term interests of each individual and the long-term interests of the
whole group. In other words, the agent want to maximize its own score to win the game, but also get a
relatively high group total score. Our target is letting agents collaborate.
Due to this consideration, we divide the reward into two parts. One is its own reward, another is
the group reward. Then combine them with a different proportion, 0.7 for own reward and 0.3 for group

6
Hyper parameter value
Size state 21
batch size 32
Train per episode 100
Initial epsilon 1.0
Min. epsilon 0.01
Epsilon decay 0.995
Learning rate 0.001

Table 2: Hyper parameter and setting of our model

reward.
Rti = 0.7 ∗ Oti + 0.3 ∗ Git (5)
Ownscoreit+1
Oti = (Ownscoreit+1 − Ownscoreit ) + , if at = f ire, Ot = Ot − 50 (6)
GroupScoret+1
Git = GroupScoret+1 − GroupScoret , if Appleit+1 = 0, Git = Git − 10 (7)

• Own reward(Oti ): agent i own score gained at time step t, and the proportion of its score in total
score. Since we think ’fire’ is a very bad action for a cooperation group, we give the ’fire’ agent an
extra punishment, directly minus 50 own reward.

• Group reward(Git ): the whole group score gained at time step t. We also consider the extreme
situation to punish the agent eats up all apples it can observe, directly minus 10 in the total reward
since it decay the public resource, which is bad for the whole group.

2.2.3 neural network structure


Since we manually extract some features as input, our reduced feature vector is not large, a simple neural
network with two hidden layer works fine for us. Each of the hidden layer has 64 neurons, the activation
function is Relu. Output is predicted Q value for each action. Loss function is mean square error.

2.3 Experiment and Results


We run different experiments by modifying the space representations, the reward faction, and hyper-
parameters like the amount of units per hidden layer. We used between 6 and 10 agents in our experiments
running each agent in different ports and using the same model architecture (but different random weights’
initialization and states). We used the default number of initial apples (5). Additionally, we also varied
how many times we trained our model within the episode.
We started our experiments by using as baseline the architecture used by [5], which had two hidden
layers with 32 units. However, we noticed that for the models with a larger size of the state representation
(A and D1 from Table 1) this network was not able to learn anything within the first tens of episodes. The
agents used to do one single action independent of the state in which they were. Moreover, they used to
be very aggressive in the initial models, shooting all the time to other agents. Then, we started increasing
the number of neurons per layer, because the dimensionality of the input space was too large to be learnt
from a model with the flexibility that the baseline amount of units gave us. By doing this, the agents
started to learn moving towards apples. The learning process was rather slow, so we decide to design a
more compact input with more valuable information state representations (B1+ C+ D2). In this way,
we ended up with the model described in the previous section. In the Figure 7 is presented the learning

7
process for the last model, as the median among the MSE over each episode. From this curve it is evident
the rapidly decreasing of the error in the first episodes, and then it continues reducing for a short time until
it gets almost flat. This fast learning is due to the fact that we trained our model 100 times within each
episode and stored just the model of the winner as an evolutionary selection process, besides the fact of
using the reduced input space representation. Additionally to the architecture settings, we detected that
by penalizing the shooing in the reward function, and rewarding for eating apples and occupying the fist
place among the players, the agents learned not to shoot all the time and look for the apples and harvest
them without over depleting the environment.

Figure 7: Loss during training.

The learned behaviour for the agents is evidenced in the Figure 8. In the left we can observe the average
final score’s increasing trend among the players, per episode. Besides, on the right we can observe the curve
for the equality metric (introduced by [5]). This last metrics was very useful for detecting if some agents
are depleting all the apples greedily, since for the previous models if used to be very unstable around low
values. Although, for our last model our agents learned look for the apples and harvest them, but at same
to cooperate to keep the common pool resources by time leaving some apples in the area so that they can
grow again. In this way the overall score for all the agents increases and the equality measure starts to get
stable around 0.9. Therefore, this result is comparable to the one of [5] for the same metric, in which they
have a game with a very similar setting and had to learn to share common pool resources in a multi agent
setting.

Time Allocation
Each of us spend 15 to 20 hours on task2; around 60 hours on task3, which includes the paper reading
time and writing report time.
For Task 2 Bingqian worked on RSP game using Q learning; Fabian did the Fictitious play; Camila and
Bingqian worked on joint action learner (Q learning with opponent modeling); Fabian and Camila worked
on plotting.
For task 3 Bingqian worked on the frame of the code ,data set and reward function; Fabian worked
on the feature representation; Camila worked on neural network and training scripts; Camila and Fabian
worked on evaluation and plotting. All of us worked on model training.
Report is wrote by all of us.

8
Figure 8: Average score of all agents (left) and corresponding equality (right).

References
[1] Caroline Claus and Craig Boutilier. (1970). The Dynamics of Reinforcement Learning in Cooperative
Multiagent Systems. The National Conference on Artificial Intelligence (AAAI 1998).

[2] Mirzaev, Inom and FK Williamson, Drew and G Scott, Jacob. (2018). egtplot: A Python Package for
Three-Strategy Evolutionary Games. Journal of Open Source Software. 3. 735. 10.21105/joss.00735.

[3] Z. Leibo, Joel and Zambaldi, Vinicius and Lanctot, Marc and Marecki, Janusz and Graepel, Thore.
(2017). Multi-agent Reinforcement Learning in Sequential Social Dilemmas.

[4] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M.A.
(2013). Playing Atari with Deep Reinforcement Learning. CoRR, abs/1312.5602.

[5] Julien, Pérolat and Z. Leibo, Joel and Zambaldi, Vinicius and Beattie, Charles and Tuyls, Karl and
Graepel, Thore. (2017). A multi-agent reinforcement learning model of common-pool resource appropri-
ation.

You might also like