Professional Documents
Culture Documents
Abstract
For task two, we conducted experiments on Rock-Paper-Scissor game including replicator dynamics; Q-
learning learning trajectory in iterative RPS game; fictitious play in RPS game; Q-learning with opponent
modeling. We find the agent will converge to Nash equilibrium(also Pareto optimal), independent of
initialization.
For task three, we developed and trained a Deep Q learning model to play the Harvest game in a multi-
agent setting. Our solution provides an agent capable of appropriate of a resource for his own benefice
while dealing with tragedy of the commons dilemma. Our solution consists of a DQN with two hidden
layers, a reward function that incentives the consumption of apples, being in the first positions among
the players and penalizes for over-depleting an area or shooting and a compact state representation of the
environment.
1 Opponent Modelling
In this part we work on the Rock-Paper-Scissor game. The replicator dynamics of this game and the
reward matrix we used in all experiments shown in Fig1.
Figure 1: Illustration of replicator dynamics for unbiased Rock-Paper-Scissors game with the given payoff
matrix. 2
1
best strategy is to choose a random action. In this situation, Nash equilibrium is also Pareto optimal. we
apply same Q learning algorithm for experiment.
Q: Whether our Q-learning agent will converge to Nash and Pareto equilibrium with different
initialization? The learning process shown at Fig. 2 depicts that even initialized with a biased action
probability, two agents will learn the best strategy: playing randomly, the probability of three action will
become the almost the same. It will converge to the mixed Nash equilibrium and also to Pareto Optimality
independent of the initial value, see Fig. 3.
Figure 2: Action probability in learning process of iterative RPS game, with biased action probability
initialization. (1) Left: Agent1 initialized with high probability to perform Rock; (2) Right: Agent2
initialized with high probability to perform Paper.
Figure 3: Learning trajectories for independent Learners(Q-learning agent) with different initialization
derived from Q value.
2
1.2 Fictitious Play
Another simple method to archive coordination between agents is fictitious play. In fictitious play, the agent
keeps a model of other agents by storing their previous moves and deriving their respective probabilities to
model the probability its own move. For this it assumes that the other agent plays its action a0 ∈ {r, p, s}
with probability
c(a0 )
P (a0 ) = (1)
c(r0 ) + c(p0 ) + c(s0 )
Figure 4: Behaviour of Agent performing fictitious play against (1) random agent, (2) agent only playing
rock and (3) agent with heavy bias towards playing paper.
Fig. 4 compares an agent applying fictitious play against biased opponents: By updating its beliefs
about the probability with witch the enemy is playing it will play a more favorable move against its
opponent. Note that when playing against an opponent that only chooses one specific action the optimal
solution would be to play the actions counter (i.e. play always paper against an enemy that always plays
rock). Our implementation however will choose to play the counter only in with P = 0.66. This is owed
to the fact how we compute the agents next move aself given a reward matrix A as the normalized action
probability with A · (P (a0r ), P (a0p ), P (a0s ))T .
3
Caroline and Craig proposed a method to combine Q-learning with opponent modeling[?]. Instead of
using single action in Q learning algorithm, they used joint action of all agents. Each JAL agent maintains
empirical distribution of opponents. Based on joint action Q value and probability of opponents’ action,
they calculated a expected value of each action for every agent. The formula is:
X Y j
EV (ai ) = Q(a− i ∧ ai ) P ra−i [j] (2)
a−i ∈A−i j6=i
ai is my agent action, a−i are all opponents actions. Pr is probability of an opponents action, products them
together. Actually this method is measuring agent every action based on opponents action probability and
the estimate reward of joint action. Then use this expected value as Q value with Boltzmann exploration
to choose action.
Research Question:
Figure 5: Action probability in learning process of Joint action learner in RPS game, with biased action
probability initialization. (1) Left: Agent1 initialized with high probability to perform Scissor; (2)Right:
Agent2 initialized with high probability to perform Rock.
4
Figure 6: Learning trajectories for joint action Learners(Q-learning with opponent modeling) with different
initialization.
2.1 Algorithm
2.1.1 DQN Algorithm
The basic idea behind many reinforcement learning algorithms is to estimate the action value function,
by using the Bellman equation as an iterative update, until it converges to optimal action value [4].In Q
learning, it’s a typically linear function approximator. We can also use a neural network with weight θ as
the function approximator, such a network is Q-network. A Q-network can be trained by minimizing mean
square error between estimated action value and target value at each iteration[4].
The target is based on network weights and reward. When optimizing the loss function Li (θi ) of iteration
i, the parameters from the previous iteration θi−1 are fixed[4]. Usually we optimize the loss function by
stochastic gradient descent.
This algorithm is model-free, it directly use samples from agents, without explicitly construct model of
environment and agent. It’s also off-policy, learning maximizing action value at each step based on reward
directly. In practice, the action is often selected by -greedy strategy, which means with probability to
select randomly, 1- probability selecting greedy(select maximum Q-value action)[4].
5
action the agent take. In this case, parameters could get stuck in a poor local minimum, or even diverge
catastrophically[4].
The technique experience replay alleviates the problem mentioned above. we store agent’s experience
et = (st , at , rt , st+1 ) at each time-step in a data set. st is current state, at is current taken action, rt is the
immediate reward by taking at , st+1 is the next state by taking action st+1 . We train Q-network every
time with a mini-batch of data randomly sampled from experience data set. This approach has several
advantages[4]: First, each step experience is potentially used in many weight updates, which improve data
efficiency. Second, it breaks constructiveness between samples, the training data distribution is averaged
over many of its previous states. using experience replay helps to smooth our learning and avoid oscillations
or divergence in the parameters. Our training process will be much stable with experience replay technique.
2.2 Implementation
2.2.1 Feature Vectors
Over the course of training we tried multiple combinations of input features with a varying size between 18
and 225 variables (see table 1). The agent is only informed about a 15·15-sized window with the agent being
placed in the centre. If an object is placed one tile left of the agent the global coordinates of that object
and respectively the agent within the 36 · 16-sized window of the harvest game don’t matter. Therefore we
decided to use a local representation of objects: Given an agent placed at a = (xa , ya ) and an object placed
at o = (xo , yo ), then the object can be represented towards the agent as o00 = (xa − x0o , ya − yo0 ). In order
to accommodate the agent being able to look and go through walls we had to transform the coordinates
of objects across the wall first (o0 ).
6
Hyper parameter value
Size state 21
batch size 32
Train per episode 100
Initial epsilon 1.0
Min. epsilon 0.01
Epsilon decay 0.995
Learning rate 0.001
reward.
Rti = 0.7 ∗ Oti + 0.3 ∗ Git (5)
Ownscoreit+1
Oti = (Ownscoreit+1 − Ownscoreit ) + , if at = f ire, Ot = Ot − 50 (6)
GroupScoret+1
Git = GroupScoret+1 − GroupScoret , if Appleit+1 = 0, Git = Git − 10 (7)
• Own reward(Oti ): agent i own score gained at time step t, and the proportion of its score in total
score. Since we think ’fire’ is a very bad action for a cooperation group, we give the ’fire’ agent an
extra punishment, directly minus 50 own reward.
• Group reward(Git ): the whole group score gained at time step t. We also consider the extreme
situation to punish the agent eats up all apples it can observe, directly minus 10 in the total reward
since it decay the public resource, which is bad for the whole group.
7
process for the last model, as the median among the MSE over each episode. From this curve it is evident
the rapidly decreasing of the error in the first episodes, and then it continues reducing for a short time until
it gets almost flat. This fast learning is due to the fact that we trained our model 100 times within each
episode and stored just the model of the winner as an evolutionary selection process, besides the fact of
using the reduced input space representation. Additionally to the architecture settings, we detected that
by penalizing the shooing in the reward function, and rewarding for eating apples and occupying the fist
place among the players, the agents learned not to shoot all the time and look for the apples and harvest
them without over depleting the environment.
The learned behaviour for the agents is evidenced in the Figure 8. In the left we can observe the average
final score’s increasing trend among the players, per episode. Besides, on the right we can observe the curve
for the equality metric (introduced by [5]). This last metrics was very useful for detecting if some agents
are depleting all the apples greedily, since for the previous models if used to be very unstable around low
values. Although, for our last model our agents learned look for the apples and harvest them, but at same
to cooperate to keep the common pool resources by time leaving some apples in the area so that they can
grow again. In this way the overall score for all the agents increases and the equality measure starts to get
stable around 0.9. Therefore, this result is comparable to the one of [5] for the same metric, in which they
have a game with a very similar setting and had to learn to share common pool resources in a multi agent
setting.
Time Allocation
Each of us spend 15 to 20 hours on task2; around 60 hours on task3, which includes the paper reading
time and writing report time.
For Task 2 Bingqian worked on RSP game using Q learning; Fabian did the Fictitious play; Camila and
Bingqian worked on joint action learner (Q learning with opponent modeling); Fabian and Camila worked
on plotting.
For task 3 Bingqian worked on the frame of the code ,data set and reward function; Fabian worked
on the feature representation; Camila worked on neural network and training scripts; Camila and Fabian
worked on evaluation and plotting. All of us worked on model training.
Report is wrote by all of us.
8
Figure 8: Average score of all agents (left) and corresponding equality (right).
References
[1] Caroline Claus and Craig Boutilier. (1970). The Dynamics of Reinforcement Learning in Cooperative
Multiagent Systems. The National Conference on Artificial Intelligence (AAAI 1998).
[2] Mirzaev, Inom and FK Williamson, Drew and G Scott, Jacob. (2018). egtplot: A Python Package for
Three-Strategy Evolutionary Games. Journal of Open Source Software. 3. 735. 10.21105/joss.00735.
[3] Z. Leibo, Joel and Zambaldi, Vinicius and Lanctot, Marc and Marecki, Janusz and Graepel, Thore.
(2017). Multi-agent Reinforcement Learning in Sequential Social Dilemmas.
[4] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M.A.
(2013). Playing Atari with Deep Reinforcement Learning. CoRR, abs/1312.5602.
[5] Julien, Pérolat and Z. Leibo, Joel and Zambaldi, Vinicius and Beattie, Charles and Tuyls, Karl and
Graepel, Thore. (2017). A multi-agent reinforcement learning model of common-pool resource appropri-
ation.