You are on page 1of 2

Show, Attend and Play using Deep Reinforcement Learning

John Doe

Proposal
A huge amount of work has been done on the applications of Reinforcement Learning (RL) on Image based
environments. A typical example could be learning the best policies for Atari games. And most of time,
the policy search takes the whole frame as its input. But a few key things to note here is,

1. Not every pixel of a frame is important to understand the environment


2. We can get the idea of the environment by attending to specific “important” areas in images

If we can get the model to take account for the relevant part of an image somehow, the convergence for
policy search can happen in much quick fashion. A recent work addresses this issue by doing an object
segmentation from the video frames of an Atari game (Goel, Jameson and Pascal). Mainly,

• Uses unsupervised video object segmentation to segment the moving objects in our frames.
This is helpful because, in most cases, moving objects are one of the most important
aspects of the environment
• Combines the moving object segmentation map with the input image features which are
then taken as input to predict policy and state values (refer to the figure below).

For a given frame, we can divide it in two parts:

A. Moving objects
B. Static objects
Our proposal is to modify the Static object detection network to make it better. Our idea of doing it is by
introducing an attention model. Attention models (Xu) (T. e. Xu) gained much popularity in object
detection as they can specify a part of the image for the network to look at.

Here, we will be applying it to video frames to get a better and faster object detection. And we hope
that the learning would be much faster. Our proposed model will look like

We will be mostly using Actor-Critic based policy learning and value optimization

Evaluation

Finally, we will be comparing the results of baselines/previous works with this model.

References

Goel, Vikash, Weng Jameson and Poupart Pascal. "Unsupervised video object segmentation for deep
reinforcement learning." Advances in Neural Information Processing Systems. 2018.

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention."
International conference on machine learning. 2015.

Xu, Tao, et al. "Attngan: Fine-grained text to image generation with attentional generative adversarial
networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
2018.

You might also like