Aa203 Finalreport

AA203 Final Report
Paula Stocco, Danny Pugh, and Anish Mokkarala
{stoccop, drpugh, mchanish}@stanford.edu
July 17, 2023
Abstract
Obstacle avoidance is of high importance within control theory. Flappy Bird has been used
in the literature to test both model free and model aware controllers on novel mixed integer
dynamics. This project compares the advantages and disadvantages of using Model Predictive
Control and Deep QLearning for solving Flappy Bird, and discusses results for generalized cases.
Link to video presentation: https://youtu.be/7ja1yGE40jc
1 Introduction
Progress in the fields of autonomous vehicles and robotics have brought complex control problems
that are highly relevant to solve. Games have been used as controllable environments for the
development and testing of new controllers, ranging from methods that rely on accurate system
dynamics to model-free learned control. Flappy Bird consists of a bird flying and gliding at a
consistent forward velocity who flaps under the players control. Points are scored as the bird
passes pairs of vertically hanging pipes.
Flappy Bird has been used with reinforcement learning techniques to achieve scores over 1600
[1]. As noted in [2] , reinforcement methods do not require knowing game dynamics, but being
a simulated environment the model is perfectly known. Therefore this simulation provides an
opportunity to compare learning and optimal control methods.
Section 2 reviews previous work with Flappy Bird. Section 3 provides an overview of the
problem, sections 4.1 and section 4.2 discuss MPC and RL implementations, respectively, with
section 5 reviewing results and discussing their interpretation.
2 Related Work
Games with simple dynamics and goals have been used for comparing different controllers. This
includes comparison of manually tuned heuristic, optimization based control, MPC control, and
learned control [1, 2, 3].
2.1 Heuristic controller

[2] used a hard coded controller based off the strategy used to play the game manually. The
controller follows the general policy where if the bird is below the top of the lower pipe plus or
1
minus some offset the control is to flap and otherwise no flap. The heuristic control is tuned through
trial and error.
2.2 Model Predictive Control

[2] formulates an optimization problem and solves it over a prediction horizon. The MPC approach
only requires tuning N , this was done by simply increasing the value of N until the controller is
able to achieve a high score. The results for this approach were just under 500 average points per
game, which was the cap for the experiment and the highest score on a full trial (no cap score) was
3,961 points. The most notable disadvantage of model predictive control is the computation time
as well as computation time variation.
2.3 Machine Learning

[1] treating the problem as a Markov decision processes (MDP) some effective reinforcement learning
techniques can be applied. The take away from this approach is that RL is effective as a control
technique given enough training time, which is expensive up front, but can be reapplied quickly
after. The technique is robust to variations and is guaranteed to find the converge given sufficient
training time, which cannot be said of other control techniques.
[2] Some areas to potentially improve a reinforcement learning approach could be through
genetic algorithms or simulated annealing which utilize a strategic collection of data through con-
trolled reduction and random sampling.
3 Problem Formulation
Flappy Bird provides novel, uncommon dynamics which are discrete time non-continuous, shown
in equation 4. Furthermore, control only affects the bird’s position in one direction, upwards, and
no control is available downwards. Rather, the bird must drift down via gravity. In addition to
unique control and dynamics, obstacles (pipes) must be avoided. This provides an interesting test
case for different discrete time control and planning methods.
This project aims to explore the trade offs made when implementing different controllers. Previ-
ous work has included open-loop receding horizon controllers, such as MPC [2], as well as a form of
closed-loop optimal control, u∗t = π ∗ (t, xt ), such as through a Support Vector Machine, a machine
learning method which provides a policy given any state input [1].
3.1 Flappy Dynamics

The general problem setup for this Flappy-Bird manifestation is to control a discrete non-linear
system
xk+1 = f (t, xk , uk ) (1)

where k ∈ N for time step, xk ∈ R2is the state, uk ∈ R is the control input.
The state for the OpenAI Flappy Bird is output by the simulation, given below, and shown in
Figure 1 along with the game’s positive coordinates which are with respect to the bird:
xk = [Xk , Yk ] (2)
2
The state is the distance from the center of the bird to the center of the incoming pipe, and the
vertical distance from the center of the two pipes to the center of the bird.
The discrete-time dynamics of Flappy bird is known
to be the following:
ẋk = [Ẋ, Vk ] (3)
where Vk+1 is encoded as the following:

Vf lap , uk = 1
Vk+1 = (4)
max(Vmax , Vk + g), uk = 0
where u of zero is no flapping, u of one is flap, and Figure 1: Flappy Bird State
Vf lap and Vmax are known constants. For a glossary of
terms see the Appendix. This can be written as a single
line equation as:
Vk+1 = −2.5uk + (Vk + g)(1 − uk ) (5)
Equation 5 shows more plainly the case of nonlinear dynamics where state and control variables
are multiplied. In this case linerization via the Taylor expansion or other approximation is not
applicable as we have a discontinuous function.
These dynamics for the next state from equation 2 when substituting the equation4 can be
written as:
xk+1 = [Xk + Ẋ, Yk + Vk ] (6)
3.1.1 Rotation
Note the game animation also shows FlappyBird rotating about its center, but this is aesthetic
only and does not affect FlappyBird motion or collision, thus can be neglected.
4 Implemented Controllers
4.1 MPC Controller
We run our solver in a model predictive control configuration. Model predictive control rolls out a
trajectory as an optimization problem, in this case a convex optimization problem, and resolves it
at every action step. This turns an open loop trajectory into closed loop control, at the expense
of needing to solve a perhaps very complicated problem in real time. This is not always feasible.
[3] MPC tuning is simple compared to other approaches. The only parameters that require tuning
are the cost and the control planning horizon for each iteration. The following equation was used
to define our convex optimization problem.
3
N
X −1
minimize |YN | + |VN | + |Yk | + Uk
Y,V,U
k=o
subject to Y0 = Yinit
V0 = Vinit
Yk+1 = Yk + Vk , ∀k ∈ {0, 1, ..., N − 1}
(7)
Yklb ≤ Yk ≤ Ykub , ∀k ∈ {0, 1, ..., N }
− Vk+1 + M Uk ≤ Vf lap + M , ∀k ∈ {0, 1, ..., N − 1}
Vk+1 + M Uk ≤ −Vf lap + M , ∀k ∈ {0, 1, ..., N − 1}
− Vk+1 + Vk − M Uk ≤ −Af all , ∀k ∈ {0, 1, ..., N − 1}
Vk+1 − Vk − M Uk ≤ Af all , ∀k ∈ {0, 1, ..., N − 1}
The system has piecewise affine dynamics with an acceleration update if no control is applied
and an instantaneous update to the maximum upward velocity if a control is applied, as given in
equation 4.
X has a fixed update and all objects update at the same rate so it is simpler to exclude X and
refer to the update in time (t) or iteration (k) directly when talking about the progression forward
through the maze.
Boundary constraints are applied for obstacle avoidance and simply apply a linear constraint
that bounds Yt from above and below by the ground and sky if there is no pipe and by the upper
pipe height and lower pipe height if there is a pipe at that t.
MIP (mixed integer programming) binary control problem must be modeled using mixed integer
programming since the controls are limited to either 1, or 0. Mixed Integer Programming is a
methodology allows the specification of convex (usually linear) optimization problems that include
integer/boolean variables. [3] pursue this approach using the Big M method to handle constraints
around the binary controls. The Big M method is an approach used in mixed integer programming
(MIP) to handle constraints that involve both continuous and binary (integer) decision variables.
It is a technique that converts the binary variables into continuous variables with large upper and
lower bounds, allowing them to take on extreme values that effectively mimic their binary nature.
We used Gurobi to solve this problem since it is known for its performance in solving mixed integer
optimization problems and is the standard used in most other papers solving MIP.
Since the solutions are bounded by the obstacles there is no way the controller could return
a solution that results in a collision, however if the controller cannot see future obstacles it may
navigate to regions that results in infeasibility as new information enters the prediction horizon.
So with this the problem becomes determining a prediction horizon that is sufficient for persistent
feasibility but also minimizes computation time. The work by [2] shows the prediction horizon
necessary to prevent collisions must include two pipes within the horizon. This is stated as 80 time
steps for the space between pipes and an additional space of 10 time steps to allow the bird to
position into a feasible route between the two pipes. Translating this into the timesteps used in
our model the pipes are 40 time steps apart so a horizon of 45 is used.
The goal of the rewards shaping is to minimize the optimization time by reducing the number
of branches required by the branch and bound technique used by Gurobi. This is explored by [3]
4
Table 1: Rewards Shaping
Rewards Avg Opt Time (sec) Max Opt Time (sec) Persistently Feasible
U 0.6872 1.1728 No
V 0.7435 2.8360 Yes
V 1.5290 6.1700 No
Y + abs(V ) 0.5168 0.7820 Yes
Y +U 0.4686 0.9403 Yes
Y + U, YN + abs(VN ) 0.4666 0.6185 Yes
by reducing the range of the problem constraints, however it was determined that this has little
effect on the optimization time. To extend this into examining the rewards we will attempt to find
some combination of possible rewards that minimizes the optimization time.
The table shows the various approaches starting with a simple trial with independent rewards.
Listing out the independent rewards and their justification: control, to minimize control effort,
magnitude of velocity, to promote stability (ie not flapping up too fast towards infeasible regions)
and bird position relative to the obstacles, promote a neutral position between boundaries. The
results showed that the best combination of rewards is to use control and relative position at each
stage and have a terminal reward of magnitude of velocity and relative position.
4.2 Reinforcement Learning

For a model-free learning-based controller, we implemented a Deep-Q Network (DQN) algorithm.
Since the state space we considered is continuous, using tabular methods was impractical. The
DQN was invented by Mnih et al [4], and it combines the Q-Learning algorithm with deep neural
networks (DNNs).
4.2.1 Q-Learning Equations & Algorithm

In reinforcement learning we incrementally update the action value estimate Qθ (s, a). In Deep
Q-Learning that parameterized approximation is a neural network. The update is derived from
the Bellman expectation equation, and rewritten in terms of expectations over samples reward and
next state rather than explicit transition an and reward functions:
Qi+1 (s, a) = Es′ ∼ϵ [r + γ max

′
Qi (s′ , a′ )|s, a] (8)
a
where s′ is the next state, r is the reward, ϵ is the environment, and the Qi (s, a) is Q-Network
at the ith iteration. [5].
We want to minimize loss between approximation and optimal action value function, choosing
at each iteration our stable target approximation [6]:
1
Li (θi ) = Es′ ∼ϵ [(yi − Q(s, a; θi ))2 ] (9)
2
To encourage stability, we maintain a main and target neural network. The parametric rep-
resentation being the linear weights of the neural network they are differentiable, and the final
gradient descent step for the parameters using the gradient of the loss function is [4][6] :
5
Listing 1 Deep Q-Learning with Experience Replay
Initialize replay memory D to capacity N
Initialize action-value function Q with random weights
for episode in range(1, M )) do
Initialise the game emulator
while not DONE do
With probability ϵ select heuristic action ak
otherwise select ak = maxa Q(sk , a′ ; θ)
Execute action ak in emulator and observe reward rk , state sk+1 , and if DONE
Store transition (sk , ak , rk , sk+1 ) in D
Sample random minibatch of transitions (sk , ak , rk , sk+1 ) from D
Set target = rk + γ maxa′ Qi (s′ , a′ ; ϕi )
Perform a gradient descent step on main NN using target
if C updates to DQN since last update to target NN then
Update the target NN weights with main NN weights Q̂ → Q
end if
Set sk+1 = sk
end while
end for
θi,k+1 = θi,k + α(rt + γ max

′
Qi (s′ , a′ ; ϕi ) − Q̂(s, a; θi,k ))∆θ Q̂(s, a; θi,k ) (10)
a
We are therefore computing Q-learning targets with respect to fixed parameters at each time
step, t, in an episode, i, and on every C episode update the ϕ target NN weights are updated with
the θ main NN weights.
This experience replay decorrelates data. We therefore modify eq. 10, using a replay memory
taken under the learning policy.
4.2.2 Settings & Inputs

In our implementation the two DNNs, a main neural network and a target neural network, are
used to approximate the non-linear Q-function. Each of them is two layers deep, with 32 neurons
in each layer using the ’ReLU’ activation function. The output is two values each corresponding to
the Q-value of the two actions i.e. flap or no-flap.
• State space: The state space for the Q-Learning algorithm refers to the same parameters as in
section 3.1 but includes also the vertical velocity of the bird. The bird’s downward velocity is
a function of acceleration over time, and thus can not be known by Xk , Yk alone. By adding
Vk in the state space the neural net can encode the transition to the next height state, Yk+1
deterministically rather than probabilistically. Papers that use pixel data have to include a
history of several frames in their state space to achieve this [5].
 
Xk
sk =  Yk 
Vk
6
• Rewards: The agent gets a reward of one every time it doesn’t hit a pipe and gets a negative
reward (hyperparameter) if it hits a pipe.
• Discount factor: A discount factor of 0.99 was used so that the agent doesn’t get myopic and
gives importance to rewards in the future.
• Learning rate: A learning rate of 10e-4 was used with the Adams optimizer to decrease the
mean squared temporal error and train the neural network.
5 Results
5.0.1 Reinforcement Learning
For the learning-based controller, we trained several neural networks with different combinations
of reward functions, state spaces, and two different reward scenarios were used. The first used a
positive reward of 1 for every time step the bird does not crash and a negative reward of -3 when
the bird crashed. The second included a positive reward for staying close to the centre of the
approaching pipe.
Two different state spaces were used. The first replicated what was seen previously, with just the
horizontal and vertical distance to the centre of the approaching pipe, the horizontal and vertical
distances to two centres of the two approaching pipes, and the vertical velocity of the bird. The
second included the vertical and horizontal distance from the bird to the second pipe, as means to
encode longer horizon data.
Warm starting was done by using a the simple heuristic controller, see Appendix, and then
switching take random actions with some probability while training to help the bird explore new
states and action combinations.
Among all strategies, the best DQN controller was the one warm started with the simple
heuristic controller with no random actions, had a simple state space of just the horizontal and
vertical distances to the approaching pipes, and had only a positive reward of 1 for staying alive
every time step. It was able to reach a maximum score of 191 pipes.
5.0.2 Comparison
Initial results included implementation of the baseline control scheme, used as a heuristic for RL.
The controller passed 216 pipes before collision on testing, and proved a tough baseline. In compar-
ing which controller works best for FlappyBird, we review parameters between controllers, including
game score, computation time, implementation ease (w.r.t. hyperparameter tuning, formulation
ext) to determine the best overall performer.
From looking at the raw scores in 2, the Heuristic controller performed better and more con-
sistently than DQN, though the learned network was able to replicate similar results. MPC was
tested only to a score of 100 due to the processing time required to run the controller during the
game. It was the most consistent controller and for this reason we consider it most realiable and
most likely to be the highest score.
Therefore, in addition to considering score planning time during the run would make MPC
less feasible on a controller. While this could be improved with different computing hardware, for
embedded systems there is less leeway. DQN took more time offline but ran quickly in realtime.
7
Controller Max Score
Heuristic 216
DQN 191
MPC 100+
Table 2: Implementation Summary
A possibility would be to use the MPC controller to warm start or perform behavior cloning
with a DQN. Then the solution found offline could be run realtime on hardware with the neural
network.
As mentioned in the reference section 2, previous attempts with Convectional Neural Networks
using pixel data achieved highscores over one-thousand after training. There are a few reasons this
can be hard to achieve:
• Tuning heuristic parameters can be difficult. For example, the learning rate can depend on
the reward scale.
• Backpropogation can assign low values to states encountered often but are not intrinsically
disadvantageous to be within. For example, Ebeling-Rump et al. [7] note that the states near
the opening in the pipe can have low values assigned initially because in those states the bird
crashes. However, avoiding being near the opening of the pipe is not a successful behavior.
• Q-learning can get stuck in a sub-optimal region. Given our continuous state space, it may
be more difficult to explore. One solution could be to discretize or otherwise reduce the state
space.
• Reward tuning can be difficult as well, and a strategy may be to shape rewards for example
using our heuristic behaviors.
• Further improvements for RL could be in: implementing an actor critic network, implementing
a CNN with image information, model based RL using the available dynamics, and further
reward shaping.
The best overall performing controller was heuristic, given the setup and handtuning was by far
the least time consuming. For highest score with fastest performance, example of RL controllers or
a combination between out DQN and MPC controller may produce best yields.
8
6 Appendix
6.1 Code
All code can be found in our GitHub repository https://github.com/sto-pau/AA203Project
6.2 Consent to Share

We give consent to share our work with future students of AA203
6.3 Heuristic Controller
c = −0.05 #hand t un e d
#f i r s t o b s e r v a t i o n i s h o r i z o n t a l d i s t a n c e t o t h e p i p e
i f obs [ 1 ] < c :
#a c t i o n 0 means do n o t h i n g , 1 means f l a p
action = 1
else :
action = 0
6.4 Glossary of Terms

Glossary for terms and hard-coded constants:
Vmax = 10: max vel along Y, max descend speed
Vmin = -8: min vel along Y, max ascend speed (not used)
g = 1: players downward acceleration
Vf lap = -9: players speed on flapping
u: action, 1 is flapped and 0 is no flap
References
[1] Y. Shu, L. Sun, M. Yan, and Z. Zhu, “Obstacles Avoidance with Machine Learning Control
Methods in Flappy Birds Setting,” en,
[2] M. Piper, P. Bhounsule, and K. K. Castillo-Villar, “How to Beat Flappy Bird: A Mixed-Integer
Model Predictive Control Approach,” in Volume 2: Mechatronics; Estimation and Identifica-
tion; Uncertain Systems and Robustness; Path Planning and Motion Control; Tracking Control
Systems; Multi-Agent and Networked Systems; Manufacturing; Intelligent Transportation and
Vehicles; Sensors and Actuators; Diagnostics and Detection; Unmanned, Ground and Sur-
face Robotics; Motion and Vibration Control Applications, Tysons, Virginia, USA: American
Society of Mechanical Engineers, Oct. 2017, V002T07A003.
[3] philzook58, Flappy Bird as a Mixed Integer Program, en, Oct. 2019. [Online]. Available: https:
//www.philipzucker.com/flappy-bird-as-a-mixed-integer-program/.
[4] V. Mnih, K. Kavukcuoglu, D. Silver, et al., Playing atari with deep reinforcement learning,
2013. arXiv: 1312.5602 [cs.LG].
9
[5] K. Chen, “Deep Reinforcement Learning for Flappy Bird,” en,
[6] M. J. Kochenderfer, T. A. Wheeler, and K. H. Wray, Algorithms for decision making. Cam-
bridge, Massachusetts: The MIT Press, 2022.
[7] M. Ebeling-Rump, M. Kao, and Z. Hervieux-Moore, “Applying Q-Learning to Flappy Bird,”
en,
10

Aa203 Finalreport

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aa203 Finalreport

Uploaded by

Copyright:

Available Formats

AA203 Final Report

Paula Stocco, Danny Pugh, and Anish Mokkarala

{stoccop, drpugh, mchanish}@stanford.edu

July 17, 2023

2.1 Heuristic controller

2.2 Model Predictive Control

2.3 Machine Learning

3.1 Flappy Dynamics

xk+1 = f (t, xk , uk ) (1)

ẋk = [Ẋ, Vk ] (3)

where Vk+1 is encoded as the following:

xk+1 = [Xk + Ẋ, Yk + Vk ] (6)

4.2 Reinforcement Learning

4.2.1 Q-Learning Equations & Algorithm

Qi+1 (s, a) = Es′ ∼ϵ [r + γ max

θi,k+1 = θi,k + α(rt + γ max

4.2.2 Settings & Inputs

Table 2: Implementation Summary

6.2 Consent to Share

6.3 Heuristic Controller

6.4 Glossary of Terms

You might also like