You are on page 1of 9

Image Caption Generator using

Transformers
Abstract
In the realm of Computer Vision and Natural Language Processing, the fusion of Convolutional Neural Networks
(CNN) and Transformers represents a pivotal research area. Image caption generator aims to generate textual
captions for image automatically using a combination of CNN for efficient feature extraction from images, and
Transformer encoder-decoder architecture for caption generation. The Transformer architecture is known for its
capability to capture long-range dependencies in text data, making it suitable for generating informative and
content-aware image captions. The primary objectives involve investigating model’s efficiency, exploring novel
techniques to improve image understanding and evaluating its performance against existing methods. Our key
message is the promise of more accurate and semantically meaningful image captions with implications for
accessibility, image indexing and human-computer interaction.
Keywords - Image Processing, Encoder-Decoder, Transformers

Introduction
Image caption generation is a challenging task in the domain of computer vision and natural language processing.
It involves the generation of descriptive and coherent natural language sentences that accurately describe the
content of an image. This capability has numerous applications, including aiding visually impaired individuals,
enhancing content retrieval systems, and improving user experiences on social media platforms.

Traditional methods for image captioning often relied on combinations of handcrafted features, such as color
histograms and edge detectors, and sequence-to-sequence models. However, these methods had limitations in
terms of generating captions that are both contextually relevant and linguistically coherent.

Recent advancements in deep learning and natural language processing have led to the development of
Transformer-based models, such as BERT (Bidirectional Encoder Representations from Transformers).
Transformers have shown remarkable success in a wide range of NLP tasks, including machine translation, text
generation, and text classification. Image captioning is a application of deep learning that combines computer
vision and natural language processing (NLP) to generate descriptive textual captions for images.
This project aims to leverage the power of Transformer models to create an effective image caption generator.
We propose a novel approach that combines image features extracted from Convolutional Neural Networks
(CNNs) with a pre-trained BERT model to generate natural language captions for images. The model is trained on
a large dataset of images with corresponding captions, allowing it to learn to generate coherent and contextually
relevant descriptions of the images.

In this paper, we will present the details of our approach, including data preprocessing, model architecture, and
training methodology. We will also provide an extensive literature review to contextualize our work within the
existing research landscape.
Fig. 1. Window sliding across the image and generating captions

The task of Image Captioning requires the integration of both Computer Vision and Natural Language Processing
techniques. In recent years, there has been a growing interest in this field, leading to the development of several
methods for solving the problem Before Deep Learning approaches for generating caption for images, there have
been many probabilistic model including Multiple Instance Learning model, which detects the objects from the
given image and generate the object words by applying CNN network. Image captions are used to train the
generated words and through probability distribution most likely words are choose and caption is generated.
Graph based Image Captioning like G-Cap showed 10% higher accuracy than its previous modes.

Idea behind this approach is to represent images, caption and regions as nodes and link them into a graph. This
model did not focus on edge weight which could increase the captioning accuracy. Markov-Logic Network (MLN)
uses probabilistic inference to estimate uncertainty in generated captions. Captions are passed, attention
mechanism is used to add visual context and is passed to MLN. This approach generated captions close to human-
generated captions. Initially, Image Captioning was tackled using Recurrent Neural Networks (RNNs) and Long-
Short Term Memory (LSTM) networks. Although these methods generated captions effectively, they faced the
issue of vanishing gradients where the gradients decreased as the model processed longer sequences.

Automatically describing the content of images using natural language is a fundamental and challenging task.
With the advancement in computing power along with the availability of huge datasets, building models that can
generate captions for an image has become possible. On the other hand, humans are able to easily describe the
environments they are in. Given a picture, it’s natural for a person to explain an immense amount of details about
this image with a fast glance. Although great development has been made in computer vision, tasks such as
recognizing an object, action classification, image classification, attribute classification and scene recognition are
possible but it is a relatively new task to let a computer describe an image that is forwarded to it in the form of
a human-like sentence. For this goal of image captioning, based on semantics of images should be captured here
and expressed in the desired form of natural languages. It has a great impact in the real world, for instance by
helping visually impaired people better understand the content of images on the web. So, to make our image
caption generator model, we will be merging CNN-Transformers architectures. Feature extraction from images is
done using CNN. The information received from CNN is then used by Transformer i.e Transformer encoder and
Transformer decoder for generating a description of the image.
Related Work
Initially, Image Captioning was tackled using Recurrent Neural Networks (RNNs) and Long-Short Term Memory
(LSTM) networks. Although these methods generated captions effectively, they faced the issue of vanishing
gradients where the gradients decreased as the model processed longer sequences. In order to overcome this
problem we can use Transformers.

❑ LSTM :
Generated Caption: A cute white and grey cat is sitting on a bench looking outside
❑ TRANSFORMERS:
Generated Caption: A cat with white and grey fur is perched on a brown bench gazing out in to the world

Before Deep Learning approaches for generating caption for images, there have been many probabilistic model
including Multiple Instance Learning model, which detects the objects from the given image and generate the
object words by applying CNN network. Image captions are used to train the generated words and through
probability distribution most likely words are choose and caption is generated. Graph based Image Captioning
like G-Cap showed 10% higher accuracy than its previous modes. Idea behind this approach is to represent
images, caption and regions as nodes and link them into a graph. This model did not focus on edge weight which
could increase the captioning accuracy. Markov-Logic Network (MLN) uses probabilistic inference to estimate
uncertainty in generated captions. Captions are passed, attention mechanism is used to add visual context and
is passed to MLN. This approach generated captions close to human-generated captions.

To address this Title Suppressed Due to Excessive Length 3 challenge, researchers proposed the use of an
Attention Mechanism in conjunction with RNNs or LSTMs, allowing the network to focus on the most important
parts of the input and make predictions based on relevant information. There have been multiple modifications
on attention mechanisms including soft attention which focuses on specific parts of the input data when making
the prediction. The word “soft” implies it uses continuous distribution to weigh the importance of the different
parts of input data. On the contrary Hard attention selects one region at random and only focuses on that
particular area. Global attention which considers the entire input data globally without giving preference to any
specific part. It allows the model to learn from all parts of the input and make predictions based on a holistic
understanding of the data. In Local Attention prior to computing the attention weight in the windows to the left
and right of the selected point for alignment, first determines a position for alignment. It then gives the context
vector a weight. Adaptive attention with visual sentinel is a popular attention technique, in which the model
chooses whether to concentrate on the picture or the sentinel. This method resolves the problem with earlier
attention systems that had difficulty anticipating short words that the decoder could not see, such as “for” and
“and”. More recently, the Transformer architecture has emerged as a leading approach for Image Captioning. This
architecture, based on self-attention mechanisms, has proven successful in a variety of NLP tasks, including
machine translation and text classification.

Convolutional Neural Networks (CNNs) are a crucial component of many image captioning systems. They help
extract meaningful features from images, which can then be used by a language model to generate captions

1.Input Image: The process starts with the input image, which is represented as a grid of pixels, typically in the
form of a 3-dimensional tensor (height x width x channels). The channels represent the color information, usually
RGB channels.
2.Pre-trained CNN : In most image captioning models, a pre-trained Cnn is used for image feature
extraction.Popular choices include VGG, ResNet,Inception and MobileNet among others.These CNNs have been
pre-trained on large scale image datasets like ImaageNet, which allows them to learn meaningful and general
visual features.
3. Convolutional Layers: The preprocessed image is passed through a series of convolutional layers. These layers
consist of filters (also known as kernels) that slide over the image in small patches. Each filter detects different
features in the image, such as edges, textures, or more complex patterns.
4. ReLU Activation: After each convolution operation, a Rectified Linear Unit (ReLU) activation function is applied
element-wise to introduce non-linearity. ReLU replaces negative values with zero and keeps positive values
unchanged.
5. Pooling Layers: After the convolutional layers, pooling layers are often applied. Max-pooling is a common
pooling technique where the image is divided into non-overlapping regions, and the maximum value within each
region is retained. Pooling reduces the spatial dimensions of the feature maps, making them smaller and more
manageable.
6. Fully Connected Layers (Flattening): The output of the convolutional and pooling layers is a 3D tensor
representing extracted features. To use this information for captioning, the tensor is flattened into a 1D vector.
7. Feature Vector or Image Embedding : The flattened vector is then passed through one or more fully connected
layers. These layers learn higher-level features and relationships between the extracted features from the
previous layers.
8. Feature Extraction:The output of the fully connected layers is the final feature vector. This vector encapsulates
the most important visual features of the input image, which can be used as a context for generating captions.
9. Caption Generation:The feature vector is fed into a language model, typically a Recurrent Neural Network
(RNN) or Transformer-based model. This language model generates captions by predicting the next word in the
sequence based on the image features and the previously generated words. The process continues until an end
token is generated or a predefined maximum caption length is reached.
10. Training: During training, the model is provided with pairs of images and their corresponding captions. The
loss between the predicted captions and the ground truth captions is calculated, and the model's parameters are
updated to minimize this loss through backpropagation.
11. Inference: During inference (caption generation for new images), the learned CNN features are extracted
from the image, and the language model generates captions based on these features.

Evaluation Methods
A variety of performance metrics exist for evaluating the efficacy of generated captions. In this study, we employ
the BLEU metric to gauge the quality of the captions generated. BLEU evaluates the similarity between the
machine generated text and the reference text through the calculation of n-gram overlap, with higher scores
indicating a closer match. BLEU is a commonly used method in the field of machine translation and has
applications in other natural language generation tasks as well.

3.1 METEOR

In the field of image captioning, METEOR is a widely used evaluation metric that measures the correspondence
between words in the generated captions and the reference captions. It typically employs techniques such as
WordNet or Porter- Stemmer to perform a one-to-one mapping of the words in the captions. This mapping is used
to compute an F-score, which represents the overall performance of the generated captions. Despite its popularity,
METEOR has seen less use in the last few years, particularly with the rise of deep learning models for natural
language generation (NLG).

3.2 SPICE

SPICE (Semantic Propositional-based Image Caption Evaluation) is a met ric for evaluating the performance of
image captioning models. It measures the overlap between the model-generated captions and the human annotated
captions in terms of recall, precision, and F1-score. Recall measures the proportion of relevant information in the
human-annotated captions that is present in the model-generated captions. Precision measures the proportion of
the model generated captions that are relevant to the human-annotated captions. The F1- score is a single score
that balances both measures by taking the harmonic mean of recall and precision.

3.3 ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the quality of automatically generated
captions by comparing them with ideal captions generated by humans. The ROUGE evaluation package contains
multiple measures, including ROUGE N, ROUGE W, ROUGE S, and ROUGE L. 4 Dataset The selection of the
dataset significantly impacts the outcome of a deep learning model in image captioning. There are various datasets
available for this task, such as the widely-used standard benchmark datasets such as MS COCO.

4 Dataset
The selection of the dataset significantly impacts the outcome of a deep learning model in image captioning.
There are various datasets available for this task, such as the widely-used standard benchmark datasets such as
MS COCO, Title Suppressed Due to Excessive Length 5 Flickr 8k, and Flickr 30k, each of which associates multiple
captions with each image. This big data requires optimization to reduce latency and data redundancy. The results
obtained from different datasets can vary. A significant number of generic photos are needed in a dataset for
image captioning, and preprocessing might also be necessary. However, because it does not require
preprocessing, using subword-based tokenization approaches, such BPE, is frequently favoured. The quality of
the captions and the frequency of the words play a critical role in determining the quality of the dataset. For
instance, if an image has multiple captions each with a unique vocabulary, the model may not be able to
accurately predict the words to describe the image. Researchers frequently utilise the Karpathy split because it
streamlines the evaluation process, allocating 5000 photos for the test set, 5000 images for the validation set,
and the remaining 5000 images for training.

Proposed Methodology
We are proposing to use Transformer encoder-decoder unlike the conventional methods (LSTM), as previously
mentioned this Transformer encoder outperforms the conventional LSTM. In the Transformer encoder, the main
methods used to process the input data and create a contextual representation include:
1. Self-Attention Mechanism: The self-attention mechanism is a fundamental component of the Transformer
encoder. It allows the model to compute attention weights for each element in the input sequence based on its
relationship with other elements. In the context of image captioning, the input sequence typically consists of the
image embedding with positional encodings. The self-attention mechanism allows the model to capture
dependencies between different regions of the image, enabling it to attend to relevant visual cues while
processing the image.
2. Multi-Head Attention: To enhance the capacity of capturing different types of dependencies and relationships,
the Transformer encoder employs multi-head attention. It consists of multiple parallel selfattention layers, called
attention heads, each learning a different attention pattern. The output of all the attention heads is then
concatenated and linearly transformed to create the final contextual representation. Multi-head attention
enables the model to focus on multiple aspects of the image simultaneously, providing a more comprehensive
understanding of the visual content.
3.Positional Encoding: Since the Transformer encoder does not inherently capture the sequential order of
elements like RNNs, positional encodings are added to the input sequence. Positional encodings provide spatial
information about the image features, allowing the model to consider the spatial relationships between different
parts of the image. This is essential for understanding the visual layout and context.
4.Feed-Forward Neural Networks: After the self-attention mechanism, the output of the Transformer encoder
passes through feed-forward neural networks, typically consisting of fully connected layers. These networks
process the information further, capturing higher-level patterns and interactions between visual features.
5. Normalization: To stabilize the training process and facilitate gradient flow, layer normalization and residual
connections are often used in the Transformer encoder. Layer normalization normalizes the activations across
the features dimension, and residual connections allow the model to retain important information from the
input.
6. Masking: During the training of the Transformer encoder, masking is applied to the self-attention layers to
prevent the model from attending to future elements in the sequence. This masking ensures that the model
generates the caption sequentially during training, maintaining causality in the generated sequence.
These methods collectively enable the Transformer encoder to process the input image and create a contextual
representation that captures the relationships and dependencies between different regions of the image. The
contextual representation is then used as input to the language generation component (e.g., Transformer
decoder or RNN), which generates the caption word by word, taking into account both the visual information
from the encoder and the previously generated words in the caption.

5.1 Model Architecture


Researchers have been keen upon finding the best Image Captioning models. Various CNN models were
introduced to accomplish this task like the VGG16, VGG1, Inception V3, ResNEt50. Most of these architectures
consisted of parallel convolutional layers stacked upon each other. Among these different layers, Inception V3
has been one of the most popular choices mainly because of its higher efficiency, is faster and is computationally
less expensive. It consists of 42 convolutional layers and a lower rate of error than its predecessors.
Inception V3 - Inception deep convolution approach was presented as GoogleNet. This architecture breaks down
the convolutions by replacing the existing filters by 1- D filters. The main feature of this architecture is that it
concatenates multiple different sized convolutional filters into one filter. This reduces the computational
complexity as a lesser number of parameters have to be trained. In our model, the image is first passed to
InceptionV3, which extracts its characteristics and saves them in a numpy file. Once features are extracted it is
passed to the transformer or LSTM network for caption generation tasks.

Transformers - Transformer addresses the issue of the vanishing gradient problem that occurs in traditional RNN
and LSTM models when processing lengthy text inputs. The attention mechanism, which calculates the
importance of each input text, is utilised to overcome this problem. The Encoder and Decoder blocks make up
the Transformer, a fully attentive architecture. These blocks are made up of numerous layers of feed-forward,
positional encoding, and multi-head attention layers. The input, which is produced by combining input
embedding and position en coding, is sent into the encoder, which then creates an attended representation with
contextual information for each word in the input sentence. A masked multi-head attention layer in the Decoder,
in addition to the lay ers in the Encoder, makes sure that future words are not taken into account when making
predictions. The output of the Encoder serves as the input for the multi-head attention layer in the decoder,
which generates one word at a time and pays attention to previously generated words.

fig: Transformer model architecture

Multihead Attention - It allows the model to attend to different positions in the input sequence, in parallel,
allowing for greater representation power. In Multi Head Attention, the input sequence is first linearly projected
into multiple heads, with each head having its own set of weight matrices. These projections are then processed
using scaled dot-product attention to compute attention scores between each query and each key. These scores
are then used to compute a weighted sum over the values, which represent the context vectors. Finally, the
outputs of each head are concatenated and linearly transformed back to the original representation space. A
stronger representation of the input is produced because of the Multi-Head Attention mechanism, which allows
the Transformer to process relationships between several points in the input sequence simultaneously and pay
attention to the most crucial information in each head.
: fig
:Multihead Attention
LSTM - Researchers frequently employ LSTM (Long Short Term Memory) in text translation, audio to text
conversion, and other applications. Its core architecture is comparable to that of RNNs, but its repeating module
design is different. It has basic architecture similar to RNNs but differs in the design of iteration modules. It has
structures called as cell states which preserves selective recent information in it. The information to be preserved
passes through structures called gates.

Gates decide which part of the information should be stored and which part of the information should be
discarded. The sigmoid functions that make up these gates have values between 0 and 1. The information is sent
on if the value returned is 1, otherwise it is discarded.

Architecture:
1.CNN for Image Feature Extraction: The input image is passed through a pre-trained CNN to extract meaningful
visual features, generating a fixed-size image embedding (It refers to a meaningful representation of an image in
a numerical form that captures the most relevant visual information and a dense vector that encodes the high-
level features and characteristics of the image) that represents the salient information in the image.
2.Transformer Encoder: The image embedding is then fed into a Transformer encoder, which consists of multiple
self-attention layers and feed-forward neural networks. The Transformer encoder processes the image
embedding to create a contextual representation, capturing relationships between different regions of the
image.
3.Positional Encodings: Positional encodings are added to the output of the Transformer encoder to provide
spatial information about the image features. These encodings enable the Transformer decoder to consider the
arrangement of visual information in the image.
4.Transformer Decoder: The image embedding with positional encodings is used as input to a Transformer
decoder specialized for natural language generation. The Transformer decoder consists of self-attention layers
and feed-forward neural networks. It generates the caption word by word in a sequential manner, attending to
relevant visual cues from the image embedding and considering the previously generated words in the caption.
5.Training: The entire model, including the CNN, Transformer encoder, and Transformer decoder, is trained end-
to-end on a large dataset containing paired images and their corresponding captions. The training process
optimizes the model's parameters to minimize the difference between the generated captions and the ground-
truth captions using appropriate loss functions, such as cross-entropy loss.
6.Inference: After the model is trained, it can be used for inference on new, unseen images. During inference,
the image is passed through the CNN to obtain its feature representation, which is then processed by the
Transformer encoder. The Transformer decoder generates the caption word by word, taking into account both
the visual information from the encoder and the previously generated words from the caption. The generation
process continues until an end-of-sequence token or a predefined maximum caption length is reached.

Conclusion
In the realm of computer vision and natural language processing, the development of an "Image Caption
Generator using Transformers" represents a remarkable fusion of cutting-edge technologies that promises to
transform the way we interact with visual content. Throughout this project, we embarked on a journey to harness
the power of transformers, a transformative neural network architecture, to tackle the challenging task of
automatically generating descriptive and contextually meaningful captions for images. This pursuit was
motivated by the need to enhance human-computer interaction, accessibility for the visually impaired, and
numerous applications in content retrieval, recommendation systems.

Using Transformers for image caption generation has revolutionized the field by enhancing the quality of
generated captions. These models, known for their superior sequence-to-sequence modeling and cross-modal
understanding, have shown remarkable capability in translating visual content into coherent and contextually
relevant textual descriptions. Leveraging pre-trained Transformers and fine-tuning them for image captioning
tasks offers a cost-effective approach, while their adaptability makes them versatile for a wide range of
applications, from object recognition to creative captioning. Some models even merge vision and language
processing within a single architecture, enabling more holistic and integrated image captioning. As research in
this domain continues, we can expect ongoing improvements in the accuracy and interpretability of captions,
reinforcing the importance of Transformers in modern image captioning systems.

You might also like