You are on page 1of 1

One Pager on: Show me a story: Towards Coherent Neural Story Illustration

Hareesh Ravi, Lezi Wang, Carlos M.Muniz, Leonid Sigal, Dimitris N. Metaxas,
Mubbasir Kapadia

Student Name: Shawkh Ibne Rashid

This paper proposed a method to solve the inverse problem of retrieving correlated
images from a paragraph. The authors referred to the input as story-in-sequence (SIS) and
the output from their model as images-in-sequence (IIS). To find the images from SIS, they
have proposed an end-to-end neural architecture that takes the form of an encoder-decoder.
This model encodes the sentences and decodes these predicted feature representations into
a correlated set of images. To reference the objects and persons in the image, they used
a coherence vector. The model consists of a two-stage GRU-RNN network along with a
VGG-16 CNN architecture. The first GRU-RNN network encodes every word of a sentence
to form a feature vector. If there are n sentences in a paragraph, then n number of feature
vectors will be generated. The second stage of the model introduces sequential nature in the
previously obtained vectors. The corresponding image feature vectors are obtained from
the pre-trained VGG-16 model. The whole network is trained using an order embedding loss
function, which helps to constrain feature vectors from stories to be as close as possible to
the image feature vectors.

The authors conducted a user study with the help of AMT workers to compare the result
of their proposed model with two other models (Baseline Network and Network without
Coherence) and also with the ground truth images on the VIST dataset. The AMT workers
preferred the result of the proposed model to other models, and in some cases, more than
ground truths. The authors also proposed a visual saliency based metric to measure the
coherence among the output images. The goal is to see whether the objects and people
which are common in the paragraph are also maintained in the images. It seems that with
the increase in the number of considered images, Baseline Network performed better than
the proposed network. But in terms of consistency, the proposed model outperformed
others.

The authors pointed out the need for a better evaluation metric for this particular task.
They have noted this as a future task. Also, VIST was the only dataset that authors could find,
which had sequences of images associated with their corresponding text description. There
are some paragraphs in here where coherence is not well defined. So a more comprehensive
dataset can be produced. Not many works had been done to address this problem of the
paragraph to image representation. So there is a good scope of improvement in solving this
problem. Different RNN and CNN architectures, along with parameter tuning, can be tried
to increase the performance of the model.

You might also like