Professional Documents
Culture Documents
1
above 900.0, and no model has solved the game yet. niques during last few years, scholars have developed un-
precedented deep reinforcement learning algorithm as well.
Since Mnih et al. redefined the era of deep RL[10], many
people solved many different games (e.g. Atari and OpenAI
Gym) with different sorts of architectures [14]. Recently,
there have been significant efforts in tackling environment
with continuous setting, for example, with continuous ac-
tion space, because such games are highly similar to the
actual world[8][11][9][17].
However, while many deep RL algorithms receive raw
pixels as inputs and therefore, take advantage of CNN heav-
(a) OpenAI Gym CarRacing-v0 Webpage ily to process those image inputs, there have been no exten-
sive exploration of how we can find a proper CNN archi-
tecture, how we can preprocess the inputs for CNNs, and
whether we can apply pretrained CNNs to a target environ-
ment. Some works have rigorously experimented with hy-
perparameters of CNN only[2] and some works have bench-
marked different types of deep RL algorithm[3]. Although
Kawaguchi et al. showed that deep learning needs to be
carefully set to get expected performance[6], there has been
no exploration of CNNs and state-of-art deep RL combined.
Here, we will extensively discuss how our novel modifica-
tion of A3C could give a competitive score by carefully set-
ting CNNs.
3. Method
3.1. Convolutional Neural Network and Image pre-
(b) In-game Screenshot of CarRacing-v0
processing
Figure 1. Our Target Simulated Environment: OpenAI Gym
CarRacing-v0. Figure 1(a) is a screenshot of the main page of As mentioned in the simulation environment section, the
OpenAI Gym CarRacing-v0 experimental environment. Figure CarRacing-v0 environment represents each game state by a
1(b) is a screenshot of the actual game being played by a person. 96 96 3 RGB array. The bottom 12 96 3 pixels
area (see Figure 1(b)) contains the car dashboard displaying
state information such as the velocity, acceleration, gyro-
2.2. Related Work scope and the relative driving wheel angle of the car agent.
There have been recent breakthroughs in convolutional It is important to note that these state information are indi-
neural networks to solve image recognition and com- rectly represented by the 96 96 3 game state array, and
puter vision related task. From the landmark work of is not explicitly accessible anywhere within the framework.
AlexNet[7], which was one of the first works that tackle Since the state frame only contains raw RGB pixel data
ImageNet classification challenge with convolutional neu- and we dont have the exact knowledge on how each state
ral networks, there have been many modifications and information is rendered on the bottom dashboard of the
improvements in the vanilla-version of CNNs such as game state frame, we decided to use Convolutional Neural
VGGNet[16], SqueezeNet[4], and hierarchal CNN or HD- Network as the feature extractor for our A3C model. This
CNN[18]. Many studies focused on implementing deep CNN module behaves as an add-on to the A3C model, ex-
but efficient and stable CNN architecture. However, one tracting visual information from the given state frames and
of the breakthroughs that made many vision task easier is squashing them to an n dimensional real vector. The A3C
the concept of transfer learning, which can improve the model, upon receiving the feature vector, computes the pol-
performance of learning by avoiding much expensive data- icy and expected return of the agent at the given state.
labeling efforts[13]. There have been many applications of Due to a number of reasons further explained in the re-
transfer learning from well-trained networks like VGGNet sults section, instead of training the CNN feature extractor
to various domains including cancer detection, video clas- with raw RGB pixel data, we take a number of steps to pre-
sification, and writing recognition[15][5][1]. process the state frames.
Aided with such remarkable achievements in CNN tech- First, we apply grey-scaling to the original raw pixel
2
frame to reduce the depth of the image from 3 to 1. Then, action our agent chooses is always one of those five actions.
we subtract 127.0 from each pixel so that the values are That is, even though the action space of the environment
zero-centered from range -127 to 128. This is so that the is entirely continuous space, the agent ignores that impor-
model becomes more robust with randomization of the filter tant characteristic just because of the intrinsic architecture
weights. Third, as mentioned in the previous paragraph, we of vanilla A3C model (see Figure 3 for vanilla A3C exam-
remove the dashboard by removing the 1296 pixels area in ple). As we will see in Section 5, DDPG, which is targeted
the bottom of each frame. We additionally crop out 96 6 for continuous action space, is not apt for our target environ-
pixels from both left and right ends of each frame so that ment due to our environments complexity, and vanilla A3C
each frame is a 84 84 square array. Since we cropped out model with cherry-picked discrete actions gives us some-
the bottom dashboard area, we need a way to restore the lost what reasonable results. However, we suggest that the sim-
velocity, acceleration and driving wheel position informa- ple A3C model can still be improved by incorporating con-
tion back into our game state. Therefore, we lastly concate- tinuous nature of softmax probability when choosing opti-
nate 5 consecutive 8484 preprocessed frames to construct mal policy. Instead of simply taking an argmax of softmax
a single state representation. This is the same method used probability vector to be the optimal action, which actually
by Mnih et al.[10] to extract spatial movement information might be extreme when softmax probability of each action
in a number of the Atari game problems. The resulting state is close to each other as we will discuss in the example be-
array has the shape of 84 84 5. low (Figure 4 and 5), by multiplying the softmax probability
This array is then fed to our CNN feature extractor. As with the argmax action, we can smooths out the five discrete
further explained in the results section, we experiment with actions into a totally continuous space. Mathematically, in-
a variety of CNN layers and hyper parameters. The criteria corporating continuous certainty does not harm backprop
that we considered are 1) the performance of the CarRacing procedure as well, since doing so is merely a multiplication
agent as defined by the environment, 2) training time. Since computation.
many RL algorithms, especially A3C models are known to
take a long time to train and converge, we put heavy focus
not only on the performance but also on the training time as
well. Deeper CNN networks, as seen in the recent trends,
may be able to more accurately represent the state as fea-
ture vectors. However, with a large number of parameters
and computational complexity, it would be very hard for us
to distinguish if our model has already converged, or is still
being optimized. This is especially so in the CarRacing-v0
problem that we are dealing with, since it is computation-
ally much heavier than other openAI Gym environments Figure 2. An example of optimal policy output in vanilla A3C
and hence takes longer to execute an action at each step. model. In this contrived example, the softmax probability of the
Since training time is a very important criteria, our im- third action is the greatest, so the policy network chooses the third
plementation of CNN does not have max-pooling, batch action as the optimal action to take from given state.
normalization or dropout layers between the convolutional
layers. Instead, we attempt to replace these layers by 1)
Let us see when our suggestion of certainty multiplica-
zero centering the raw pixel values, 2) xavier weight ini-
tion is particularly helpful. Suppose an agent encountered
tialization with reduced scale, and 3) relatively shallow net-
a corner that gives a softmax probability of, for example,
work. Our results verify our hypothesis. As shown in table
[0.19, 0.19, 0.24, 0.19, 0.19] as demonstrated in Figure 4.
1, shallower and lighter CNN feature extractor showed bet-
The simple A3C model will choose action 3, taking acceler-
ter performance, with our 2-layer CNN model ultimately
ation with magnitude of 1.0 or full acceleration. This might
achieving fourth rank in the OpenAI Gym leaderboard.
work in some cases, but usually, taking full acceleration for
consecutive time frames in this game is not a good strategy.
3.2. Continuous Certainty
If an agent encounters an unfamiliar, sharp corner after con-
We introduce the concept of continuous certainty to secutive full accelerations, the agent or even a well-trained
the vanilla A3C model. It smooths out the discrete action human cannot manipulate the car properly and will most
space that the trained agent can choose from, so that the likely to deviate from the track, which will result in seri-
output of policy network becomes completely continuous. ously bad score at the end.
While we simply cherry-picked five possible actions that the Now, let us observe what happens when an agent ex-
agent would take, it contains stark disadvantages, because tracts more information from softmax probability. As de-
regardless of how well the network is trained, the optimal scribed in Figure 5, by multiplying 0.24 to [0.0, 1.0, 0.0],
3
Figure 3. A bad scenario for optimal policy selection when we use Figure 5. Our Final A3C Model Architecture. A stack of five pre-
vanilla A3C model. The softmax probability of each action is very processed frames are input into the network. The front-end two-
close to each other, and simply taking an argmax will result in one layered CNN extracts image features from the pixel frames and
extreme discrete value. pass them into the policy network and value network. Each net-
work outputs 5 1 softmax vector and 1-dimensional scalar value
estimation respectively. The argmax of softmax vector from the
policy network is then multiplied with the softmax value to give
the optimal action the agent should take.
the agent will take the action of [0.0, 0.24, 0.0] and this will
stabilize the learning as the agent will lessen how much it
accelerates. In other words, instead of taking a full accel-
eration, because the agent is quite uncertain which action 5. Results/Analysis
is definitely better than the other, it becomes more careful 5.1. CNN Architectures
in taking the argmax action. In this particular example, the
certainty of taking the argmax action ([0.0, 1.0, 0.0]) is As mentioned in Section 2, the game agent does not
only 0.24, so the agent decides to take the acceleration of know how each state information (velocity, acceleration,
0.24. One can notice that this is how an actual novice hu- driving wheel position) gets computed and rendered on the
man driver learns how to drive, when he or she is driving bottom dashboard of the game state frame. We hypothe-
the area for the first time. sized that a carefully designed CNN would be able to ex-
tract visual state information from the RGB frames, pro-
viding the A3C RL agent sufficient data to infer critical in-
formation such as the curve angle, velocity, acceleration,
driving wheel position, distance from the road center, etc.
The CNN, upon extracting these information, would embed
them in an n-dimensional state vector, with which the A3C
network in turn computes the policy and the expected re-
ward. Thus, the training process of the two networks (CNN
and A3C) is joint rather than separate.
4
way to guarantee the correctness of the labels generated by 5.1.3 CNN Performance Analysis
automated scripts. Since the number of classes is a critical
hyperparameter for image detection / classification CNNs, We tested our model with multiple CNN architectures of
and due to the fact that they are very hard to change once varying depths, from 2 layers to 7 layers. With the deeper
the network is trained, pretraining the CNN on the game CNNs, we gave filter size of 3 and stride 2 for most layers
environment is bound to be a very complex task. Lastly, so that the network still has a large enough receptive field to
since CNNs are designed to perform well on image classi- detect large critical visual components such as a large curve,
fication / detection tasks, using pretrained CNNs may not curve start indicator, etc. With shallower networks, we had
be well suited for an RL objective, which, in this case, is to to give filters with bigger filter size and stride (8 and 4 for
maximize the reward by completing the course in a timely the 2 layer model) for the same purpose.
manner. Due to the difference in the objective, pretrained Table 1 shows the performance comparison of our mod-
CNN modules may neglect crucial information the the A3C els in the CarRacing-v0 environment. Note that although
RL agent might need to perform better. Due to these rea- we have tested with more CNN architectures, only the mod-
sons, we have deeemed it unreasonable to use a pretrained els that showed reasonable performances are listed on the
network to tackle this problem. table. In Table 1, We can clearly see that the model with
the shallowest CNN architecture performs the best in the
given task. Deeper layer CNNs show converged perfor-
5.1.2 Limitations of Raw RGB Pixel States mance of about 300, while our shallowest model with 2 lay-
ers and wider filter size shows converged performance of
In the methods section, we noted that we made the design 571 (https://gym.openai.com/evaluations/
decision to preprocess the frames and stack 5 consecutive eval_IEdi97CIQeC7ZFKmM9L3dA). Moreover, a
frames to construct a single state. Interestingly, we found close inspection of the episodic results shows that the model
that it is not easy to train the CNN feature extractor to show achieves a very high score of over 700 in approximately
reasonable performance with just the RGB state frames. 25 percent of all evaluation episodes. The reason that the
Upon close analysis of the evaluation videos recorded by mean score is 571.68 is that in three out of one hundred
the OpenAI Gym Monitoring feature, we have noticed that episodes, the car agent achieves a very low score close to
our model trained with pure RGB state frames completely zero. Although the evaluation video was not saved for these
fails to learn to make curves or slow down. This led us to episodes, we were able to reproduce this behavior with the
realize that the 96 96 3 state representation, unlike the same model later. This was due to the randomness in the
1024 1024 3 frame rendered for human players during racing circuit generation of the CarRacing-v0 environment.
interactive gameplay, was too crude to be able to caputre the In the cases where our car agent ends up with a very low
subtle changes in velocity, accleration, driving wheel posi- score, we found out that the frame contains a very sharp
tion, etc. For example, the acceleration bar in the game state 160 to 180 degrees turn in the beginning of the game, and
representation had only 2 to 3 pixels height in average, indi- the frame looks like as if there are two tracks in the game
cating that our CNN had access to only an extremely lossy that you can choose from. The car agent then gets confused
representation of the original 1024 1024 3 state frame. on which road to take, and gets stuck in-between the two
Therefore we had to make a decision to change the architec- roads in the grass zone, resulting in very low scores around
ture of our CNN to take 84 84 5 preprocessed state pixel 0 to 20 points. The result proves our initial hypothesis that
arrays, instead of 96 96 3 raw frames, as mentioned in deeper CNNs would not perform better than the shallower
Section 3.1. ones. This is due to a number of reasons.
To extend the discussion on RGB pixel states, after try- First, as we can see in Figure 1, the state frames returned
ing several schemes as in 5.1.3, we conjectured that ap- by the CarRacing-v0 environment is not complex enough to
plying canny edge detection results into some noisy, low- require a deep CNN architecture. Tracks are colored in grey,
quality edges and hog features also simply loses some im- grass in green, and the car in red. The shape of a frame is
portant information regarding tracks, which do not improve only 84 by 84 (after preprocessing), even smaller than Atari
the pipeline but actually degrade the performance. On the Pong which has the state size of 210 by 160 by 3. The game
other hand, applying Laplacian edge detection improved frame is much smaller than the average image size of the
the average performance by small amount of 20, but the ImageNet examples. Moreover, each frame typically only
standard deviation was twice greater than that of our ini- contains about 5 different colors, with very simple shapes
tial choice. Therefore, for evaluation purpose, we chose a such as straight lines, square patches of grass, etc. This
simple grayscale, meanshift, and crop strategy for image means that we do not require a complex, deep CNN archi-
preprocessing instead of Laplacian edge detector in order tecture to tackle this problem.
to make our performance more consistent and stable during Second, deeper and more complex CNN architectures
evaluation. are harder to train. Since our training objective is not the
5
classification softmax scores, the backpropagation phase is 2 Threads 4 Threads
conducted not after an image and a label is shown, but af- CNN Model #1
ter the agent receives a reward after executing an action. L1-5 (5 layers) 187.03 44.54 169.25 41.87
This means that a deep CNN model that is known to achieve F=16, W=3, S=2
super-human performance in image classification tasks may CNN Model #2
not be the best model for our RL objective. It would be very L1: F=16, W=8, S=3
hard to train the large number of parameters that follow with L2: F=32, W=5, S=2 182.42 31.71 198.23 36.96
the deeper models, and it is very likely that the model would L3: F=32, W=4, S=2
fall into a local minima at an early stage of the training. L4: F=16, W=3, S=2
Third, deeper models take longer to train. Deeper con- CNN Model #3
volutional networks inherently have a larger number of pa- L1: F=16, W=8, S=3
391.26 22.62 370.02 32.14
rameters, and hence, requires a larger set of training exam- L2: F=32, W=3, S=2
ples requires more iterations to converge. This problem be- L3: F=32, W=2, S=1
comes more evident with our A3C model, since the model CNN Model #4
is known to take a lot of time to train. For example, even L1: F=16, W=8, S=4 571.68 19.38 481.65 17.91
our model with the simplest CNN architecture (CNN Model L2: F=32, W=3, S=2
4 of Table 1) took 1 million iterations to converge. Since Table 1. Effects of CNN architecture and the number of threads to
the A3C model utilizes multicore CPUs rather than GPUs, overall performance of the pipeline. The best performance could
the training time increases significantly with increase in the be observed from the two-layered CNN architecture with 2 threads
number of parameters. Moreover, we have found that it is and is highlighted in the table.
very hard to tell whether an A3C model for this environ-
ment has converged or not. As we have described in our
CS234 paper, the model improves performance after sud- Grayscale, Canny Laplacian
den bursts in the losses, and after a long period of extremely HOG
meanshift, Edge Edge
low rewards. We have not yet found the right way to decide Features
and Crop Detector Detector
whether if the model has converged or not. In our imple- 571.68 430.15 590.90 390.28
mentations, we deemed the model converged if the average 19.38 36.71 45.01 29.23
reward does not increase by 10 points for 200,000 iterations.
However, this may not be the right way to determine the Table 2. Effects of different image preprocessing strategies to over-
convergence of the A3C model with CNN. This means that all performance of the entire pipeline. The best performance could
there is a chance that the models with deeper CNN could be observed when we apply Laplacian edge detector for the image
have converged to parameters with higher performance. But preprocessing process but with high standard deviation.
it still does not change the fact that the deeper models took
significantly more time to reach a certain performance, to
the point that we think is quite unreasonable (Over 2 days to interpret, but we assume that it captures more fine details
on Google Cloud 8-core CPU mahines). of the large patches captured by the first layer.
6
Figure 8. 23rd filter of the second convolution layer.
7
trained for more than 4, 000, 000 iterations, we noticed that ference on Computer Vision and Pattern Recognition, pages
different models at different checkpoints tended to be good 17251732, 2014.
at one task but not as good at another task. For example, [6] K. Kawaguchi. Deep learning without poor local minima.
some overfitted model performed better at the straight por- In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and
tion, while some generalizable model did well in cornering R. Garnett, editors, Advances in Neural Information Pro-
but not as well as the overfitted ones when it comes to the cessing Systems 29, pages 586594. Curran Associates, Inc.,
2016.
straight lane (they would oscillate left and right in straight
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
line, which is an obstacle for getting higher score). Hence,
classification with deep convolutional neural networks. In
we would like to try an ensemble of models that are par- Advances in neural information processing systems, pages
ticularly good at straight lane and models that are good at 10971105, 2012.
cornering. We could also try some other image processing [8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez,
to boost performance. Y. Tassa, D. Silver, and D. Wierstra. Continuous con-
Last but not least, we would like to explore the perfor- trol with deep reinforcement learning. arXiv preprint
mance of other models that we did not have enough time to arXiv:1509.02971, 2015.
do so. For instance, we suspect that A3C with LSTMs can [9] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap,
enhance the performance significantly. It is true that our T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous
current model attempts to capture the window of frames methods for deep reinforcement learning. In Proceedings of
to reflect the recent history for the next move, but LSTM International Conference on Machine Learning, 2016.
is more state-of-art, reliable, and explicit way for agent to [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,
learn how to determine its next action from a set of past ac- I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari
with deep reinforcement learning. In NIPS Deep Learning
tions. We would also love to see the performance of simple
Workshop. 2013.
policy gradients and other modifications of DDQN if possi-
[11] G. Neumann. The reinforcement learning toolbox, reinforce-
ble.
ment learning for optimal control tasks. na, 2005.
[12] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-
7. Github Repositories conditional video prediction using deep networks in atari
games. In Advances in Neural Information Processing Sys-
A3C: https://github.com/sjang92/car racing tems, pages 28632871, 2015.
DDPG1: https://github.com/jessemin/racing ddpg [13] S. J. Pan and Q. Yang. A survey on transfer learning.
DDPG2: https://github.com/jessemin/car racing IEEE Transactions on knowledge and data engineering,
DDQN: https://github.com/jakekim1009/hw2 for racing 22(10):13451359, 2010.
[14] J. Peters and S. Schaal. Reinforcement learning of motor
We implemented our DDQN agent based on the code skills with policy gradients. Neural networks, 21(4):682
from CS234 Assignment2. 697, 2008.
[15] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues,
References J. Yao, D. Mollura, and R. M. Summers. Deep convolutional
neural networks for computer-aided detection: Cnn archi-
[1] D. C. Ciresan, U. Meier, and J. Schmidhuber. Transfer learn- tectures, dataset characteristics and transfer learning. IEEE
ing for latin and chinese characters with deep neural net- transactions on medical imaging, 35(5):12851298, 2016.
works. In Neural Networks (IJCNN), The 2012 International [16] K. Simonyan and A. Zisserman. Very deep convolutional
Joint Conference on, pages 16. IEEE, 2012. networks for large-scale image recognition. arXiv preprint
[2] T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up arXiv:1409.1556, 2014.
automatic hyperparameter optimization of deep neural net- [17] K. G. Vamvoudakis and F. L. Lewis. Online actorcritic al-
works by extrapolation of learning curves. In IJCAI, pages gorithm to solve the continuous-time infinite horizon optimal
34603468, 2015. control problem. Automatica, 46(5):878888, 2010.
[3] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. [18] Z. Yan, H. Zhang, R. Piramuthu, V. Jagadeesh, D. DeCoste,
Benchmarking deep reinforcement learning for continuous W. Di, and Y. Yu. Hd-cnn: hierarchical deep convolutional
control. In Proceedings of the 33rd International Conference neural networks for large scale visual recognition. In Pro-
on Machine Learning (ICML), 2016. ceedings of the IEEE International Conference on Computer
[4] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Vision, pages 27402748, 2015.
Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy
with 50x fewer parameters and 0.5 mb model size. arXiv
preprint arXiv:1602.07360, 2016.
[5] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
and L. Fei-Fei. Large-scale video classification with convo-
lutional neural networks. In Proceedings of the IEEE con-