You are on page 1of 5

Design Specification for an Image Captioning Model

Humberto Lozano Cedillo A01363184


Julio Cesar Lynn Jimenez A01793660
Sarah Mendoza Medina A01215352
David Mireles Samaniego A01302935

The following pages delineate a meticulous blueprint for the construction of an image captioning
model, encapsulating the intricate architecture, training methodologies, and evaluation metrics.

Content:
1. Model Architecture:............................................................................................................................. 1
● Image Processing Component (CNN)............................................................................................1
● Text Generation Component (RNN / LSTM)................................................................................ 2
● Integration...................................................................................................................................... 2
2. Training Specifications:....................................................................................................................... 3
● Data Preprocessing:........................................................................................................................3
● Training Regimen: Suggest the number of training epochs, batch size, and any data
augmentation strategies..................................................................................................................... 3
● Optimizer and Learning Rate: Choose an appropriate optimizer and learning rate strategy,
justifying your choices.......................................................................................................................3
3. Performance Metrics:........................................................................................................................... 3
4. Additional Considerations:.................................................................................................................. 4
5. References:...........................................................................................................................................5

1. Model Architecture:

● Image Processing Component (CNN)

Architecture: The proposed architecture suggests the use of a convolutional neural network (CNN),
such as ResNet-50, recognized for its deep layer structure and its effectiveness in image recognition
tasks. This model comes pre-equipped with extensive training on vast datasets like ImageNet, offering
a robust and advanced starting point for feature extraction.

Layers and Activation Functions: The CNN design encompasses a sequence of convolutional layers
paired with Rectified Linear Unit (ReLU) activation functions. These layers are meticulously crafted
to capture a spectrum of image aspects, ranging from fundamental edges to intricate features.
Following the convolutional layers, pooling layers step in to diminish spatial dimensions and
computational load, enhancing the model's efficiency.

Regularization Techniques: To mitigate overfitting, dropout layers will be strategically implemented


within the network. Additionally, batch normalization will be employed after each convolutional
layer. This dual approach not only aids in preventing overfitting but also accelerates the training
process, fostering stability in the model's learning dynamics.

1
● Text Generation Component (RNN / LSTM)

Architecture: The proposed architecture centers around employing a Long Short-Term Memory
(LSTM) network, chosen for its effectiveness in handling sequential data and its remarkable ability to
retain long-term dependencies. This makes it particularly adept at generating coherent captions that
align with the context of the input.

Details: The LSTM network, integral to this proposal, will be composed of 2 to 3 layers, each housing
512 units. This configuration strikes a balance, being sufficiently complex to capture the intricacies of
language, yet avoiding an excessive size that could lead to overfitting issues.

Parameters: To facilitate the transformation of input words into a format conducive to learning
semantic relationships, an embedding layer will be incorporated. This layer converts words into dense
vectors of fixed size, enhancing the model's ability to grasp contextual nuances. Following the LSTM
layers, a fully connected layer will then be employed to generate the probability distribution of the
next word in the sequence, completing the narrative flow of the caption generation process.

● Integration

The CNN and LSTM elements will seamlessly blend in this architecture. The feature vector derived
from the CNN serves as the inception point for the LSTM, offering a robust foundation for the
generation of captions. This harmonious integration ensures a smooth transition of information,
allowing the model to harness both visual features and sequential dependencies in tandem for
effective caption generation.

The process begins with an input image diving into feature extraction through a ResNet-50 model.
This involves navigating convolutional layers, resilient blocks, and activation functions to distill key
image features. These features then take center stage in a solo act by a single-layer LSTM in the Text
Generation Component. The LSTM, with a hidden state size of 256 and an embedding size of 300,
orchestrates a sequential dance to understand the nuances embedded in the features. At the end, a
seamless integration of ResNet-50 and LSTM components materializes. This duo collaborates to craft
a cohesive output—a well-crafted caption that mirrors the essence of the input image, combining
visual intricacies with sequential understanding.

𝐼𝑛𝑝𝑢𝑡 𝑖𝑚𝑎𝑔𝑒 → 𝑅𝑒𝑠𝑁𝑒𝑡 − 50 → 𝐼𝑚𝑎𝑔𝑒 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝐶𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 (𝐶𝑁𝑁) → 𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑙𝑎𝑦𝑒𝑟𝑠

→ 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝐵𝑙𝑜𝑐𝑘𝑠 → 𝑅𝑒𝐿𝑈 𝐴𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 → 𝐷𝑟𝑜𝑝𝑜𝑢𝑡 (0. 5) → 𝐼𝑚𝑎𝑔𝑒 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠

2
2. Training Specifications:
● Data Preprocessing:

The selected datasets to train and validate the model are Flickr8k, Flickr30k and MS COCO, all public
datasets. In case this model is going to be used for a task in a particular field then an additional
fine-tuning can be planned with a private dataset. However, the challenge of going that route is going
to be labeling the private images, or annotating them, as it is often described in image captioning
problems.

● Training Regimen: Suggest the number of training epochs, batch size, and any data
augmentation strategies.

To arrive at a training procedure we would need to set a baseline with a simple model and even do
some exploration that may include grid searching for the hyperparams and early stopping to ensure
the performance of the model is improving without falling into overfitting.

If a starting point is needed empiric evidence has demonstrated that between 10 and 20 epochs can be
used during training on this explorations phase. Batch size can also be explored as smaller batch sizes
can help to update the parameters more frequently, 64 can be used just like in the Show, Attend and
Tell paper.

The image preprocessing pipeline shall include resizing to 224x224 pixels and applying normalization
using the values μ = [0.485, 0.456, 0.406] and σ = [0.229, 0.224, 0.225] obtained from ImageNet.
Additional data augmentation techniques based around randomly introducing transformations can be
used to make the model more robust, some examples include resized crop, horizontal and vertical
flips, zoom in and out, perspective, blur, among others.

● Optimizer and Learning Rate: Choose an appropriate optimizer and learning rate strategy,
justifying your choices.

In a similar fashion to the grid search approach described above, we can include the learning
rate as a hyperparameter to be learned or use adaptive learning rate algorithms to find the best results.
−4 −3
An initial range of 1 𝑥 10 to 1 𝑥 10 can help to get started, although the paper written by
−7
Selivanov, A. et al, about medical image captioning goes as low as 3 𝑥 10 for its encoder-decoder.

The optimizer evaluation on the referenced literature shows Adam as the top performance in terms of
accuracy obtained, followed by AdamW and RMSprop close behind. In case the desired performance
is not achieved, other optimizers might be explored to differentiate from other related works.

3. Performance Metrics:
To evaluate model performance there are various ways we can evaluate the performance, from human
criteria to the use of benchmarks. For our consideration the best metrics we can use for this kind of
evaluation are the quantitative metrics such as CIDEr (Consensus-based Image Description
Evaluation) and SPICE (Semantic Propositional Image Caption evaluation), since they bring a
measurable criteria and we can more effectively avoid biases in our evaluation. Both of these metrics
are specialized in image captioning evaluation.

3
CIDEr is specially designed for image captioning evaluation. This metric evaluates the similarity of
generated captions to a set of reference captions, taking into account human consensus and penalizing
generic descriptions.

SPICE focuses on the semantic content of the captions. It evaluates the fidelity of the generated
captions in terms of scene graph matching, which represents the objects, attributes, and relationships
in the image. Given a set of object classes C, a set of relation types R, a set of attribute types A, and a
caption c, a scene graph of c can be analyzed as follows:

𝐺(𝑐) = {𝑂(𝑐), 𝐸(𝑐), 𝐾(𝑐)}

Where O(c) ⊆ C is the set of object mentionsin c, E(c) ⊆ O(c)×R×O(c) is the set of hyper-edges
representing relations between objects and 𝐾(𝑐)⊆ CO(c)×A is the set of attributes associated with
objects.

4. Additional Considerations:
In the field of image captioning, which combines computer vision and natural language processing,
there are significant challenges and so different strategies to address them. One of the primary
challenges comes with object recognition and detection accuracy. To address this challenge, we first
rely on the election of an acceptable dataset that can facilitate this process.

Understanding the context and relationships between objects in an image is essential for generating
meaningful captions. Improving techniques like scene graphs or attention mechanisms help the
models focus on relevant information within the image, enhancing context understanding.

Handling ambiguity and subjectivity in image interpretation is challenging since different viewers
might have various interpretations of the same image.training models on datasets with multiple
captions per image and encouraging multiple captions for a single image can help learning diverse
perspectives.

Dealing with rare and unseen objects in images is a significant challenge. techniques like few-shot or
zero-shot learning enable models to generalize to new objects, and training data that is augmented
using methods like Adversarial Networks to include examples of rare images.

Balancing computational efficiency, especially for models used in real time applications.We can
address this challenge by optimizing neural networks architectures for efficiency, using model
compression techniques.

Addressing bias and ethical concerns involves using diverse and representative datasets and
implementing fairness and bias evaluation metrics during training and testing to prevent the
perpetuation of biases presented in the training data.

With this we covered some of many more challenges that image captioning entails. addressing these
challenges in image captioning involves a blend of improved model architectures, sophisticated
training techniques, better datasets, and a focus on ethical AI practices.

4
5. References:
Castro, R., Pineda, I., Lim, W., & Morocho-Cayamcela, M. E. (2022). Deep Learning Approaches Based on
Transformer Architectures for Image Captioning Tasks [Article].

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... & Bengio, Y. (2015, June). Show, attend
and tell: Neural image caption generation with visual attention. In International conference on machine learning
(pp. 2048-2057). PMLR.

Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014). Microsoft
coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich,
Switzerland, September 6-12, 2014, Proceedings, Part V 13 (pp. 740-755). Springer International Publishing.

Selivanov, A., Rogov, O. Y., Chesakov, D., Shelmanov, A., Fedulova, I., & Dylov, D. V. (2023). Medical image
captioning via generative pretrained transformers. Scientific Reports, 13(1), 4171.

Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). SPICE: Semantic Propositional Image Caption
Evaluation. Springer.
https://researchers.mq.edu.au/en/publications/spice-semantic-propositional-image-caption-evaluation

You might also like