9 - 23 - Evolution of Visual Data Captioning Methods, Datasets, and Evaluation Metrics A Comprehensive Survey

Expert Systems With Applications 221 (2023) 119773
Contents lists available at ScienceDirect
Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa
Review
Evolution of visual data captioning Methods, Datasets, and evaluation

Metrics: A comprehensive survey
Dhruv Sharma, Chhavi Dhiman *, Dinesh Kumar
Department of Electronics and Communication Engineering, Delhi Technological University, Delhi, India
A R T I C L E I N F O A B S T R A C T
Keywords: Automatic Visual Captioning (AVC) generates syntactically and semantically correct sentences by describing
Visual Captioning important objects, attributes, and their relationships with each other. It is classified into two categories: image
Image Captioning captioning and video captioning. It is widely used in various applications such as assistance for the visually
Video Captioning
impaired, human-robot interaction, video surveillance systems, scene understanding, etc. With the unprece
Change Image Captioning (CIC)
dented success of deep-learning in Computer Vision and Natural Language Processing, the past few years have
LSTM
CNN seen a surge of research in this domain. In this survey, the state-of-the-art is classified based on how they
RNN conceptualize the captioning problem, viz., traditional approaches that cast visual description either as retrieval
Computer Vision (CV) or template-based description and deep learning approaches. A detailed review of existing methods, highlighting
Natural Language Processing (NLP) their pros and cons, societal impact as the number of citations, architectures used, datasets experimented on and
GitHub link is presented. Moreover, the survey also provides an overview of the benchmark image and video
datasets and the evaluation measures that have been developed to assess the quality of machine-generated
captions. It is observed that dense or paragraph generation and Change Image Captioning (CIC) are stimu
lating the research community more due to the near-to-human abstraction ability. Finally, the paper explores
future directions in the area of automatic visual caption generation.
1. Introduction systems, health care systems (Pavlopoulos, Kougia, & Androutsopo,

2019), scene understanding (Cordts, et al., 2016). Although many
Nowadays, it is easy to generate and collect visual data which possess classical computer-vision solutions (Long, Shelhamer, & Darrell., 2015)
copious information for addressing real-world problems such as (Ren, He, Girshick, & Sun, 2015) for object classification or detection
healthcare (Liu, Peng, & Rosen, 2019), public surveillance Xu et al., have shown promising results. They usually generate partial and un
(2017, January), sports analysis (Yang W. , 2019), anomaly detection structured outputs, such as bounding boxes and object labels in a video
(Bergman & Hoshen, 2020), crowd analysis (Xu et al., 2017, January) frame. The obtained semantic primitives can be utilized for caption
(Yang W. , 2019). It has led to easy accessibility of images and videos. generation for images/videos. Whereas, Natural Language Processing
Hence, an automatic and intelligent visual understanding and content (NLP) can be used to describe these visual observations as sentences,
summarization have emerged as a paramount interest (Bernardi, et al., which are much easier for understanding.
2016) (Singh, Doren, & Bandyo, 2020) The research community (Far Fundamental steps involved in Automatic Visual Caption generation
hadi, et al., 2010) (Kojima, Izumi, Tamura, & Fukunaga, 2000) is can be broadly defined as i) visual understanding of image and video,
working towards a smart visual understanding of the images and videos. and ii) language generation. Visual understanding of image and video is
However, there exists a large semantic gap between low-level and high- an entrenched and yet so emerging field of computer vision that has
level abstract knowledge of visual data. Caption Generation can serve as been substantially researched. It helps researchers know and mine the
a good solution to bridge the semantic gaps between low-level and high- visual information available in the frames which can be used for clas
level abstract knowledge of the visual data and serve various real-world sification (He, Zhang, Ren, & Sun, Deep residual learning for image
applications i.e., video surveillance systems (Nivedita et al., 2021, recognition, 2016), detection (Ren, He, Girshick, & Sun, 2015), seg
March) visual recognition (Redmon & Farahadi, 2018), visual assistive mentation (Long, Shelhamer, & Darrell., 2015), captioning (Farhadi,
* Corresponding author.
E-mail addresses: dhruv.0906@yahoo.in (D. Sharma), chhavi.dhiman@dtu.ac.in (C. Dhiman), dineshkumar@dtu.ac.in (D. Kumar).
https://doi.org/10.1016/j.eswa.2023.119773
Received 10 March 2022; Received in revised form 26 February 2023; Accepted 26 February 2023
Available online 3 March 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
D. Sharma et al. Expert Systems With Applications 221 (2023) 119773
et al., 2010) (Kojima, Izumi, Tamura, & Fukunaga, 2000), etc. Whereas, another interesting application where a natural language dialogue be
language generation helps convert significant visual information in the tween a guide and a tourist helps the tourist to reach a previously unseen
form of language, to finally generate captions. AVC can be further location on a map using perception, action, and interaction modeling.
categorized as: Advanced deep video understanding is further, encouraged by the
a) Image Captioning: Image captioning is a challenging task that availability of a large number of annotated videos. Video classification
describes the visual content of an image in a natural language and (Wu, Yao, Fu, & Jiang, 2016) and video captioning (Jin & Liang, 2016)
provides automated insights into images giving answers to questions like are mainly two broad research areas on the comprehension of videos.
where you are? (Beach, cafe, etc.), what do you wear? (color), and more The former focuses on automatically docketing video clips based on
importantly what you are doing? It recognizes the objects, their attri their contents and frames like complex human events and actions, while
butes, and their relationships in an image and generates syntactically the later generates a complete and natural sentence that enriches video
and semantically correct sentences. With the advancements of neural classification and captures the most informative dynamics in videos. The
networks, image captioning has gained immense popularity which helps output generated from video classification is video categories predicted
in generating human-like descriptions according to the input image. It by the classification model whereas the video captioning model is the
has a promising future to facilitate the intelligence of mankind and can predicted description of the video in the form of natural language.
serve as a helpful tool for visually impaired people. Many other appli
cations (Weiss et al., 2019) can be developed in this direction such as 1.1. Challenges of Visual Captioning
finding the expiration date of a specific food item or knowing about the
weather by taking a picture. The process of image captioning can be Human beings can easily recognize their surroundings and can
defined in three fundamental steps as i) Object Detection ii) Attributes/ describe any image or video scene in their natural language but in the
Feature Extraction, followed by iii) Sentence Generation. Initially, after case of machines, it is very difficult to generate human-like descriptions
having detected the object/image, its features/attributes of the given of images and videos. However, machines can recognize various human
input image i.e., color, objects, boundaries, and texture details are activities from video frames and images to a certain extent, but the
extracted, encrypted, and translated as an appropriate description. automatic description of visual scenes for complex and long-term human
There exist various challenges in the field of caption generation of visual activities is still a challenging task. From a linguistic perspective, ac
data. Researchers in today’s world have designed computer vision- tivity recognition is all about extracting semantic similarity among
enabled captioning models which can describe “what” (e.g., classifica human actions represented by verb phrases and transforming visual
tion (He, Zhang, Ren, & Sun, Deep residual learning for image recog information into semantic text. It is analogous to grounding words in
nition, 2016), segmentation (Long, Shelhamer, & Darrell., 2015)) and perception and action.
“where” (e.g., detection (Ren, He, Girshick, & Sun, 2015), tracking This survey has mainly focused on the textual description of images
(Kristan, et al., 2015)). However, it is bad at knowing “why”, e.g., why is and videos. It has been observed that more attention is required on the
it a girl? Note that the “why” here does not merely mean asking for generation of attractive and detailed descriptions for images and videos
visual reasons — attributes like two hands, two legs, hair, that are contents automatically. In the field of CV, we are interested in con
already well-addressed by machines; beyond, it also means asking for structing and learning models which can characterize images or videos
high-level common-sense reasons — such as a girl is climbing wooden by recognizing their categories or other high-level features. In the field
stairs — that are still elusive, even for human philosophers, not to of NLP, we usually encounter the inverse challenges to parse a language
mention for machines. Further image captioning models should be description by identifying connotations and denotation of the sequence
designed such that the model can define the caption semantics as clearly of words. These challenges arise because languages are directly related
as possible by describing multiple target objects-“bucket with green lid”, concepts rather than the lossless recording of objects or activities in the
and “white flowers”, instead of just describing a single target object-“girl real world. The major challenges to visual captioning are as follows:
in a pink dress”. To sum up, in its current art, image captioning has a) Compositionality and naturalness of natural language and visual
gained steady headway and produced brusque and generic vivid cap scenes: Traditional techniques do not acknowledge and recognize min
tions, with the introduction of Convolutional Neural Network (CNN) ute details of images and videos. Therefore, the interaction of objects is
(Szegedy & Liu, 2014; Vinyals, Toshev, Bengio, & Erhan, 2015) and an onerous task thus making the traditional techniques suffer from a lack
Recurrent Neural networks (RNN) (Karpathy & Fei-Fei, Deep Visual- of compositionality and naturalness. The biggest challenge here is the
Semantic Alignments for Generating Image Descriptions, 2015) since subtleness of the action units. Sometimes they are either not visible, or
2015. For this to grow fully and become an assistive technology, an are hard for vision techniques to detect. For instance, unclear unit
archetype is required that shifts towards goal-oriented captions; where boundaries and occlusions of interactive objects present other diffi
the caption not only describes a scene from day-to-day life but also culties to accurately decode the intention of the human activities in a
answers a specific need that is helpful in many applications. video. In some of the works, attention-based models (Pedersoli, Lucas,
b) Video Captioning: Videos, in particular, have become a-minute Schmid, & Verbeek, 2017; Long et al., 2016, December) are designed to
way of communication between internet users with the growth of mobile address this issue.
devices. With the prodigious increase in storage space and bandwidth, b) Intermediate representation learning: Learning mid-level repre
video data has been generated, published, and spread strenuously, sentations between the visual domain and natural language domain is a
becoming an indispensable part of today’s big data. It has encouraged key problem in visual-to-text techniques. Therefore, high-level visual
the development of advanced techniques for a broad range of video features are the need of art to represent the visual data completely.
understanding applications including online advertising (Hussain, et al., c) Recounting of visual contents: There exist state-of-the-art (Kar
2017), video retrieval (Alayrac, Bojanowski, Agrawal, Sivic, & Lacoste- pathy & Fei-Fei, Deep Visual-Semantic Alignments for Generating Image
Julien, 2016), video surveillance (Li, Zhang, Yu, Huang, & Tan, 2019), Descriptions, 2015; Yao et al., 2017) that recognize semantic elements in
human-robot interaction (Vries, Shuster, Batra, Weston, & Kiela, 2018), the visual data, still, fail to rank in order in accordance with the theme of
movie description (Brand, 1997), assistance to visually impaired (Jin & the image or video that may guide to generate more relevant textual
Liang, 2016). The advancements in the direction of video captioning descriptions. Also, we need to find out how much detail we are looking
open up enormous opportunities in various application domains. It is to recount and what type of language complexity is to be applied.
envisaged that in the near future, we would be able to interact with d) Benchmark datasets with moneyed text: To automatically eval
robots in the same manner as with humans (Rohrbach et al., 2012; uate language descriptions for visual contents, we need standard data
Rohrbach, et al., 2013). The recent release of a dialogue dataset, Talk the sets (Rohrbach, Amin, Andriluka, & Schiele, 2012) (Miech, et al., 2019)
Walk (Vries, Shuster, Batra, Weston, & Kiela, 2018), has introduced yet for evaluating new methods and algorithms. Sentence-level annotations
2
that are aligned to the image and video are basic requirements. For The organization of the presented survey is shown in Fig. 1. In sec
corpus descriptions of different languages, a general image or video tion 2, an analysis of the existing recent works is provided. Section 3
description system capable of handling multiple languages should be discusses visual captioning methods which include different image
developed. captioning and video captioning techniques. Section 4 presents
e) Evaluation of quality of captions generated: With the use of commonly used datasets and different evaluation metrics for image and
automated metrics (Vedantam et al., 2015; Banerjee & Lavie, 2005), we video captioning. Finally, section 5 covers the conclusion and future
can partially evaluate the quality of the captions generated. In some work.
cases, this evaluation remains inadequate and sometimes even
misleading. The best way to evaluate the quality of automatically 2. Recent Works
generated texts is a subjective assessment by linguists, which is hard to
achieve. To improve system performance, the evaluation indicators This section discusses and analyses the literature reported in recent
should be optimized to make them more in line with human experts’ surveys. A compare and contrast of earlier works (Bernardi, et al., 2016)
assessments. (Singh, Doren, & Bandyo, 2020) (Li et al., 2019, August) (Amirian,
The semantic gap between low-level and high-level knowledge Rsheed, Taha, & Arabnia, 2020) (Bai & An, 2018) (Liu et al., 2018)
abstraction, in CV and AI, has been of major concern for a long. The (Staniute & Šešok, 2019) (Kumar and Goel, 2017, November) (Hossain
bridging of this gap invites attention, the incorporation of which may et al., 2018) Aafaq et al., (2019, October). (Qi, 2018) (Islam, et al., 2021)
yield more common sense and reasoning into the scene understanding. (Martin, et al., 2021) with this survey is highlighted. The works (Ber
Also, the techniques for visual captioning should be able to leverage nardi, et al., 2016) (Bai & An, 2018) (Liu et al., 2018) (Staniute & Šešok,
more flexible semantic units, meaning thereby that various combina 2019) (Kumar and Goel, 2017, November) (Hossain et al., 2018) dis
tions of nouns, verbs, and other language units be widely explored. cussed specifically the evolution of image captioning models from
Further, the improvement in visual captioning scene understanding will traditional approaches to deep approaches. (Bernardi, et al., 2016)
make the CV system more reliable for use as an aid-to-blind, visual analyze image captioning approaches based on generation or retrieval
question answering, google image search, etc. problems over a multimodal or visual representation space and provided
an overview of benchmark image datasets: MSCOCO, Flickr8K,
Flickr30K, and various evaluation metrics which are developed to assess
1.2. Major Contributions the quality of machine-generated captions. Kumar and Goel, (2017,
November). carried out an analysis of image captioning approaches in
This paper provides a comprehensive review of the existing literature terms of speed and accuracy. In the year 2018, (Bai & An, 2018) and (Liu
on visual (image/video) captioning methods that evolved from tradi et al., 2018) presented a survey on advances in image captioning. These
tional to deep-learning models, with emphasis on the deep-learning surveys discussed the traditional methods for image caption generation,
ones. The survey intends to intensively analyze publicly available followed by an end-to-end framework for image captioning. (Bai & An,
datasets for image and video captioning followed by a comparative 2018) focused mainly on neural network-based models which are
analysis of the performance of popular methods for each dataset. It will divided into six sub-categories: encoder-decoder framework, attention-
help readers identify the key state-of-the-art establishing superior re guided, multimodal-based learning, compositional architectures, generating
sults. The paper covers the widely used evaluation metrics and discusses descriptions for images with novelties, retrieval, and template-based methods.
the pros and cons of each metric. The significant contributions of this While on the other hand, Liu et al. (2018) reviewed Dense-captioning
survey are summarized below: based methods to generate automatic captions for images. A System
atic Literature Review (SLR) (Staniute & Šešok, 2019) presented a brief
(1) Outlines a taxonomy of image captioning (Fig. 7) and video overview of improvements in captioning of images over the years and
captioning (Fig. 20) methods evolved from traditional to deep- the biggest challenges in image captioning. Whereas, Hossain et al.
learning approaches. (2018) presented a comprehensive survey of deep-learning techniques
(2) Various techniques of deep-learning techniques for image and for image captioning supporting the fact that deep-learning techniques
video captioning are discussed in detail to summarize the notable for image caption generation are emerging as capable solutions to
works in terms of different parameters such as citations, advan handle real-life challenges.
tages, limitations, datasets used, feature extraction methods, and A section of works (Singh, Doren, & Bandyo, 2020) Aafaq et al.,
text generation methods along with GitHub repository imple 2019, October) (Qi, 2018) (Islam, et al., 2021) (Martin, et al., 2021)
mentation details for future references, for available works. compared the performances of deep learning-based video captioning
(3) Presents vital challenges faced by researchers in the field of visual models, followed by the pros and cons of various evaluation metrics like
captioning, as it is very crucial to identify the challenges while ROGUE, CIDEr, BLEU, SPICE, METEOR, and WMD for captioning of
developing solutions for same. videos. (Singh, Doren, & Bandyo, 2020) categorized all datasets into two
(4) The popular image and video captioning datasets and evaluation main categories namely, open-domain datasets and domain-specific
metrics are discussed and analyzed. It helps give insight into how datasets. From (Singh, Doren, & Bandyo, 2020) and Aafaq et al.,
the evaluation parameters should be selected as the performance (2019, October). it is evident that the work in the field of video
of visual caption generation models greatly depend on the eval captioning is fast-paced development because the description of videos
uation parameter chosen. lies in the intersection between two main research areas CV and NLP.
(5) A comparative analysis of the performances of the different state- (Islam, et al., 2021) exhibited the variants of neural networks for visual
of-the-art on benchmark datasets is drawn using different eval and Spatiotemporal feature extraction and generation of natural lan
uation parameters. guage from the extracted features. It revealed that the video captioning
(6) The paper draws attention to the recently evolved branch of problem has a lot to develop in accessing the full potential of deep-
image captioning i.e., Change Image Captioning (CIC). It deals learning for captioning of video frames. (Martin, et al., 2021) catego
with the translation of changes in the scene as captioning as the rized state-of-the-art techniques for video captioning as task-oriented,
human brain does by considering the time dimensions. techniques oriented, and domain-oriented. The literature reported
(7) The structure of the paper is designed, keeping in view the need only two Li et al., (2019, August). (Amirian, Rsheed, Taha, & Arabnia,
of the budding researchers in the field of visual caption genera 2020) surveys that covered both the commodities of visual captioning:
tion by providing the recent state-of-the-art, datasets and their Image captioning and Video Captioning. These works highlighted deep-
performance, and evaluation parameters used in one place. learning architectures of visuals to text generation models.
3
Fig. 1. Structured Organization of the Survey; IC: image captioning, VC: Video captioning, SSVC: Single Sentence Video Captioning, DVC: Dense Video Captioning.
This survey has addressed the evolution of visual-to-text generation video captioning evolved. It also provides a chronological overview of
models from traditional to deep learning techniques for both images and the milestone in the field of visual (image/video) captioning. Further,
videos. In addition to this, Dense or Paragraph generation techniques for year-wise statistics of articles published and discussed in the survey from
image and video caption generation are discussed in detail supported by the year 2015–2021 in the field of visual (image/video) captioning, is
an analysis table for each category specifying each method’s strengths shown in Fig. 3, and a chord diagram of these works is shown in Fig. 4.
and demerits. The survey is the first study, to the best of our knowledge, From Fig. 3 it can be inferred that from the year 2013, the number of
that provides strategic evolution of methods discussed in detail along articles published in the area of image and video captioning for deep-
with recent updates in publically available datasets and dataset-wise learning approaches is increasing immensely. Further, Fig. 4 visualizes
comparison analysis of various approaches, in one place. Fig. 2(a) and the inter-relationship between authors and various deep-learning-based
Fig. 2(b) provide a timeline of how different methods for image and image and video captioning tasks. Also, the node of the chord diagram
Fig. 2a. Timeline for different methods for Image Captioning (IC).
4
Fig. 2b. Timeline for different methods for Video Captioning (VC).
Fig. 3. Year-wise distribution of articles for deep learning-based visual captioning techniques covered sin the survey paper.
indicates the method and the link represents that one of the works and video captioning. Image captioning generates a single sentence for
served as the baseline in evaluation. Further, the dense connections in each frame whereas when compared to video captioning, a description
the chord diagram give an insight into the contributions of various re of a complete video with one sentence is generated. In dense image
searchers and popular state-of-the-art utilized as the baseline, from 2014 captioning, the features from each frame are extracted and represented
to the present. in the form of richer and semantically aware sentences whereas in dense
video captioning each video frame is temporally detected and described
3. Visual Captioning Techniques resulting in a dense description of the whole video using spatial and
semantic details in frames. Recent approaches for image and video
Visual captioning is based on visual understanding and content captioning are discussed in detail in sub-section 3.1 and 3.2.
summarization which have gained popularity in recent years. Visual
content has data which is in the form of image and video that contains a
3.1. Image Captioning Techniques
specific theme or more specifically a scene, an event, etc. The visual
content is converted into captions which are nothing but words or
Image captioning, a popular area of research in the field of Artificial
strings of characters. This conversion of visual information to words Li
Intelligence (AI), deals with mainly two domains for the analysis of
et al., (2019, June). provides a high-level label of objects or activities
images: (i) image understanding (ii) language description. Image un
which is much easier for understanding. Visual Captioning problem is
derstanding deals with the detection and recognition of objects that help
categorized into image and video captioning which can be further
extract semantic features whereas language description carries out the
classified into a single sentence and dense captioning techniques. The
representation of sentences using both linguistic and semantic under
example, shown in Fig. 5, helps to depict the difference between image
standing of the language. The challenge is to design an image captioning
5
Fig. 4. Chord Diagram represents the comparison among the existing (a) Image Captioning Techniques and (b) Video Captioning Techniques. The node indicates the
method and the link represents that one of the works served as the baseline in evaluation. (DL-ICM (Deep-Learning-Image Captioning Methods), DL-ICT (Deep-
Learning-Image Captioning Task), SSVC-DL (Single Sentence Video Captioning-Deep Learning).
Fig. 5. Illustration of the difference between image captioning, video captioning, dense image captioning, and dense video captioning.
model that can generate more human-like rich descriptions of images Aloimono, 2011) generate descriptions with predefined syntactic rules.
with the understanding of objects or scene recognition in an image and Such methods cannot generate meaningful sentences as they cannot
the relationship among them. The image captioning problem is cate express visual content correctly. Nonetheless, the field of image
gorized into two main categories as shown in Fig. 6: (i) Traditional captioning has gained popularity in the recent past owing to the intro
Techniques (Retrieval-Based and Template-Based methods) and (ii) duction of deep-learning techniques (Cheng, et al., 2017) (Lu et al.,
Deep Learning-Based methods. Retrieval-based techniques (Farhadi, 2017). These techniques utilize encoder-decoder structures to under
et al., 2010) (Hodosh, Young, & Hockenmaier, 2013) retrieve the closest stand images. Further detailed categorization of deep-learning-based
matching images and generate descriptions as a caption of the query image captioning tasks and methods is shown in Fig. 7.
images. These methods use re-ranking to produce correct sentences but
fail to adjust descriptions for new images. Template-based image 3.1.1. Traditional Image Captioning Techniques
captioning models (Mitchell, et al., 2012) (Yang, Teo, Daume, & Traditional techniques for image captioning can be categorized as
6
Fig. 6. Image Captioning Techniques (a) Reterival-Based Caption Model (RCM), (b) Template-Based Caption Model (TCM), (c) Deep-Learning-Based Caption
Model (DLCM).
Retrieval-based and Template-based techniques. Different methods retrieve a set of query images which were further trained for the selec
related to traditional techniques with their pros and cons are discussed tion of phrases from the ones associated with retrieved images and
in the following sections. finally a description of images is generated based on the selected rele
vant phrase. (Kuznetsova, Ordonez, Berg, & Choi, 2014) proposed a tree-
3.1.1.1. Retrieval-Based Image Captioning Techniques. The most tradi based method similar to (Gupta et al., 2012), which utilized web
tional form of image captioning technique is retrieval-based image captioned images. The disadvantages of retrieval-based image
captioning. It uses a query image and produces a caption for the given captioning methods are evident. These methods generate a description
input image by retrieving a sentence or a set of sentences for a pre- for query images in well-formed human-written sentences. The gener
specified pool of sentences. The caption is generated either as a single ated descriptions are grammatically correct and fluent which means
sentence or a combination of those retrieved sentences. (Farhadi, et al., they can easily extract semantic information but requires training
2010) established a meaning space < object, action, scene > which links datasets that contain all types of attributes that adapt to new combi
sentences to images. It created a system that provides rich and subtle nations of objects or novel scenes. Under certain conditions, generated
representations of information by computing a score and linking an descriptions may even be irrelevant to image contents as this method is
image to a sentence. The score is later used to attach the descriptive not good at discovering words outside the training data. Retrieval-based
sentence to a given image. The score closest to the query image is used to methods have large limitations in their capability to describe images.
select the final description of the image as a caption. The work (Ordonez
et al., 2011) employed global image descriptors to retrieve a set of im 3.1.1.2. Template-Based Image Captioning Techniques. Template-Based
ages from a web-scale collection of captioned photographs and utilized caption generation technique generates captions both syntactically and
details of the retrieved images to perform re-ranking according to the semantically via a very constrained process. For a given image, this
similarity in contents of the query image. (Hodosh, Young, & Hock technique first detects a set of visual concepts (objects, attributes, in
enmaier, 2013) introduced a dataset for sentence-based image de formation from images) and uses a specified grammar rule which com
scriptions and evaluated using a ranking concept. It assumed that for a bines the information and describes the images by filling the obtained
given query image there always exists a sentence that is appropriate for data or information into the pre-defined blanks of sentence template.
it (Hodosh, Young, & Hockenmaier, 2013). This assumption is not al (Yang, Teo, Daume, & Aloimono, 2011) proposed a method where
ways true. Therefore, instead of using retrieved sentences as descriptions nouns, verbs, scenes, and prepositions (known as quadruplets) were
of query images directly in the other line of retrieval-based research, used to describe a sentence template. Description of images was done by
retrieved sentences are utilized to compose a new description for a query using a detection algorithm (Felzenszwalb, Girshick, McAllester, &
image. To project image and text items into a common space, Canonical Ramanan, 2010) Oliva and Torralba, (2001, May). that provided an
Correlation Technique (Bach & Jordan, 2002) (Hardoon, Szedmak, & estimate of objects and scenes in the image and further, the method
Shawe-Taylor, 2004) correlated different captions generated for each employed a language model Dunning, (1993, March). to predict words,
training sample. Further, it measured cosine similarity to determine the scenes, and prepositions for the formation of captions of the input image.
similarity in documents for text analysis in new common space and se Such techniques use a triplet of a scene of an object or an action that fills
lects the top-ranked sentences which act as a description for a query the gaps of templates in the format.
image. (Gupta et al., 2012) proposed a method for the generation of a << adj1, obj1 >, prep, < adj2, obj2 >> for encoding recognition. (Li
description for a particular query image. It extracted global features to et al., 2011) proposed an algorithm that first used an image recognizer
7
Fig. 7. Taxonomy of Classification of Image Captioning Techniques.
the purpose of which was to obtain visual information from the image, generated sentences due to the availability of a small number of visual
including objects, their attributes, and the spatial relationships between words. Moreover, compared to human-written captions, using rigid
different objects. Furthermore, Conditional Random Field (CRF) helped templates as the main structure of sentences, generated descriptions are
to render the image contents and generates image description as a tree- less natural.
generating process based on visual recognition results and represented
images by using < objects, actions, spatial relationships > triplets 3.1.2. Deep-Learning based Image Captioning Techniques
(Mitchell, et al., 2012) Kulkarni et al., (2013, June).. Fig. 8 presents the Retrieval-based and template-based captioning of images were
system flow for an example image that generates sentences based on adopted mainly in the early work. With recent advancements in deep
labeling. In Fig. 8(d), the nodes of the graph represented objects, their neural networks, researchers are embracing deep-learning-based image
attributes, and the spatial relationship among them. These are used to captioning techniques. Deep neural networks are extensively embraced
fill gaps in the templates and thus are used to complete the template and for tackling the image captioning task. Therefore, the classification of
hence describe the image. deep neural network-based image captioning is outlined based on the
Methods discussed above use visual models which predicate indi two subcategories namely: i) image captioning tasks, and ii) image
vidual words from a query image in a piece-wise manner. To generate captioning methods. The classification for different tasks and methods is
more descriptive sentences under template-based learning, phrases are shown in Fig. 7. This section further presents a detailed review of each
used in the generation of sentences (Ushiku et al., 2012). Therefore, state-of-the-art and Tables 1 to 9 demonstrate an overview of these
many methods have been proposed utilizing phrases under template- techniques.
based image captioning. The captions generated by template-based
image captioning methods are syntactically correct, and the de 3.1.2.1. Image Captioning Task (ICT). Image descriptions, generated by
scriptions yielded by such methods are usually more relevant to image the captioning model, are in the form of either a single sentence or
contents than retrieval-based ones. However, there are some disadvan paragraph, a novel object description, or a description enhanced by the
tages to these methods also. Since template-based description genera style or sentiment so generated, or the caption generated consequently
tions are strictly constrained to image contents recognized by visual to which there is a change in the images of a particular scene before and
models, there is a limitation to coverage, creativity, and complexity of incorporation of a captioning task. In view of the above, the deep-
8
Fig. 8. System flow for Template-based Image captioning: (a) object and stuff detectors find candidate objects, (b) each candidate region is processed by a set of
attribute classifiers, (c) each pair of candidate regions is processed by prepositional relationship functions, (d) A CRF is constructed that incorporates the unary image
potentials computed by 1–3, and higher-order text-based potentials computed from large document corpora, (e) A labeling of the graph is predicted, (f) Sentences are
generated based on the labeling. (Kulkarni, et al., 2013).
learning-based image captioning techniques are classified into various images. This model described novel objects and their interactions with
Image Captioning Tasks (ICT) such as (i) Single Sentence ICT (ii) Novel- other objects. (Yao et al., 2017) described a copying mechanism known
Object-Based ICT (iii) Stylized ICT (iv) Dense or Paragraph-based ICT as LSTM-C for caption generation of novel objects. In this method, a
and (v) Change ICT. The detailed overview of various image captioning classifier was developed for novel objects by using a separate dataset for
tasks with their state-of-the-art techniques is covered in sub-sections object recognition. It integrated appropriate words for output captions
3.1.2.1.1 to 3.1.2.1.5. using an RNN decoder with a coping mechanism. With the increase in
3.1.2.1.1. Single Sentence Image Captioning Task. An automated complexity of the caption generation for novel objects, a Novel-Object
Single Sentence IC encapsulates the contents of the whole image in a Captioner (NOC) (Venugopalan et al., 2016b) has been introduced to
single sentence. The caption generated may be the outcome of any type generate captions for unseen objects in images. It learned semantic
of deep-learning-based captioning method like encoder-decoder-based knowledge and various external sources to recognize various unseen
methods, semantic concept-based methods, attention mechanism- objects. This model exploited semantic information to generate captions
based methods, etc. Further, a detailed discussion of single sentence IC in the ImageNet dataset for hundreds of object categories that are not
task is provided in sub-section 3.1.2.2. observed in MSCOCO. (Wu et al., 2018) introduced the concept of zero-
3.1.2.1.2. Novel-Object-Based Image Captioning Task. Recent deep- shot novel object caption generation using Decoupled Novel Object
learning techniques for image captioning tasks have achieved favour Captioner (DNOC). It generated novel object descriptions without extra
able results, but these techniques depend mainly on paired image training sentences. The zero-shot learning technique bridged the gap
caption datasets. These methods generate captions for objects within the between visual and textual semantics. (Li et al., 2019) discussed a new
context. The novel-object-based image captioning is capable of pointing mechanism-based framework, which is also known as LSTM-P
describing novel objects which are not present in paired image caption or LSTM with a pointing mechanism that facilitated vocabulary expan
datasets. This technique is based on the following three steps: (1) un sion and encouraged global coverage of objects in the sentences gener
paired data of image and text are trained by a separate language model ated. LSTM-P provided superior results on COCO and ImageNet datasets
and a classifier. (2) a deep caption generation model is usually trained when compared with other state-of-the-art. (Venugopalan et al., 2016b)
on data (paired image). (3) the models trained in (1) and (2) are com (Yao et al., 2017). (Agrawal, et al., 2019) encouraged the development
bined and trained which generated the descriptions for the novel ob of captioning models that could learn visual concepts from other object
jects. Novel object-based captioning of images is not only trained for detection datasets. However, when these models have been applied in
image-text paired sets but is also trained for unpaired ones. The work the wild a much larger variety of visual concepts are to be learned.
(Hendricks, et al., 2016) presented a Deep Compositional Captioner (Feng, et al., 2019) presented a novel network structure known as
(DCC) which generated the captions for unseen objects present in the Cascaded Revision Network (CRN) that described an image using
9
D. Sharma et al.
Table 1
Overview of Single Sentence IC for Multimodal Learning-based Caption Generation Methods.
Ref. Citations Method Dataset Image Text Pros Cons GitHub Link
Encoder Generation
(Kiros, 672 Multimodal IAPR TC-12 AlexNet LBL The first step The problem of https://github.com/ryankiros/multimodal-neural-language-models
Salakhutdinov, & Learning towards multimodaltext retrieval with
Zemel, learning provides extraneous
Multimodal an improvement in descriptions that
Neural Language BLEU scores. do not exist in the
Models, 2014a) image
(Kiros, CNN-LSTM Flickr8K AlexNet LSTM Provides explicit Complexity https://github.com/linxd5/VSE_Pytorch
Salakhutdinov, & 1162 Flickr30K VGGNet -SC-NLM embedding between increases because
Zemel, Unifying images and of SC-NLM.
visual-semantic sentences.
embeddings with
multimodal
neural language
models, 2014b)
(Karpathy et al., 802 R-CNN Pascal 1 K AlexNet DTR Improves the Does not –
2014)(Karpathy, Flickr8K performance of the incorporate
Joulin, & Li, Deep Flickr30K image sentence spatial reasoning
fragment retrieval task. and is not a better
embeddings for sentence
bidirectional fragment
image sentence representation.
mapping, 2014)
(Mao, et al., 2015) m-RNN Flickr8K AlexNet RNN The model m-RNN with https://github.com/mjhucla/mRNN-CR
Flickr30K VGGNet incorporates more AlexNet requires
10
1131 IAPR TC-12 complex image modifications.

MSCOCO representations
with more
sophisticated
language models.
(Chen & Zitnick, LSTM + Bi- Pascal 1 K VGGNet RNN This model is Small datasets –
2015) 522 RNN Flickr8K capable of learning lead to overfitting
Flickr30K long-term
MSCOCO interactions.
(Cheng, et al., CNN + RNN Flickr30K VGGNet RNN More consistent Complex analysis. –
2017) 14 MSCOCO with expressing the
process of humans
with the generation
of a good

description of
images.
(Liu, Sun, Wang, 55 CNN + RNN MSCOCO – – The effectiveness of – https://github.com/gujiuxiang/CV-NLP_Practice.
Wang, & Yuille, the model is PyTorch/blob/master/Notes/20170218_MAT_A_Multimodal_Attentive_Translator_for_Image_Captioning.
2017) measured md
quantitatively and
qualitatively which
provides the state-
of-art result
(Zhao, Chang, & 18 CNN-RNN Flickr8K, Inception- RNN The model uses Reduces –
Guo, 2019) (GRU, Flickr30K, V3 image attribute captioning in real
LSTM) MSCOCO information to scenes, and
enhance the image cannot cover the
representation. Can rich underlying
semantics
(continued on next page)
existing vocabulary from in-domain knowledge. With lesser out-of-

domain knowledge, generated captions may contain ambiguous words
for images with novel objects. Re-edit is done for primary captioning
sentences by a series of cascaded operations after which external
knowledge is utilized which selects more accurate words for the novel
objects and hence generated accurate captions for the unseen or novel
objects. Caption generation for novel objects is a highly desirable yet
challenging task. Therefore, (Hu, et al., 2021) described VIsual VOcab
ulary pretraining (VIVO) that trained a multilayer transformer model to
generate fluent captions for novel objects along with the locations of
these objects. (Cao, et al., 2020) proposed FDM-net that bridged the gap
between expected visual information and generated visual information
to tackle the novel object captioning problem. Another work, (Wu et al.,
2023) proposed switchable LSTM that incorporated knowledge from
object memory and generate sentences that outperforms the state-of-the-
art methods that used additional language data.
Novel-object ICT has defined the trend to describe an image with
new objects that are not even present in caption corpora to provide
richer information. Further, novel-object ICT provides better object
detection (Agrawal, et al., 2019), (Feng, et al., 2019) by efficiently
describing images with unseen objects. This task needs to leverage a
large amount of data which will enhance the visual vocabulary to bridge
the gap between the ground truth and generated captions (Hu, et al.,
2021), (Wu et al., 2023). Though these methods provide state-of-the-art
results, there exists some grammatical errors in the generated de
scriptions (Hendricks, et al., 2016), (Venugopalan et al., 2016b).
GitHub Link
Therefore, grammatical or syntactical error-free description of images

for unseen objects is still a challenging task to achieve.
3.1.2.1.3. Stylized Image Captioning Task. Based on the content of
the images, existing models (Kalchbrenner & Blunsom, 2013) (Cho et al.,
–
2014) (Venugopalan et al., 2016b) generated descriptions of attributes

generated is not
In certain cases,
using factual descriptions without considering the stylized part of the

description
image from other patterns. Stylized captions are considered to be more

the image
accurate.
expressive and attractive in comparison to the flat description of the

Cons
generated images. Fig. 9 shows the block diagram representation of the

–
stylized image captioning task. It mines information from images using a

Visual attention and
improve the model.

the sg-LSTM model
descriptions which
those provided by
CNN-based encoder. A text corpus is prepared that extracts various

be used for dense
meaningful than
generates more
stylized concepts such as romance, sentiments, etc. From the informa

mechanisms
multimodal
Flickr users
tion generated from the CNN-based encoder and generated corpus, the
captioning
are more
coverage
accurate
language generation block generates attractive captions. This technique

Pros
has become very popular as this is mainly used in many real-time ap
plications. For example, people nowadays upload many photographs on
Generation
many social media platforms, which need attractive and stylized cap
tions for them. (Gan et al., 2017) proposed a novel image captioning
Inception- LSTM
RNN
Text
technique known as StyleNet that generated attractive captions with the

addition of various styles. The factual and style factors are separate from
Encoder
VGGNet
the captions generated with the use of CNN and a factored LSTM ar
Image
chitecture and outperform the existing approaches with the Flickr

V3
Style10K dataset which contains 10 K Flickr images with humorous and

BBC News
FlickrNYC
romantic captions. Attractive and stylized captions can be generated

DailyMail
MSCOCO
Dataset
with the use of multitasking sequence-to-sequence training by identi

Test
fying style factors. In our day-to-day conversations, decision-making,

and interpersonal relationships, various nonfactual expressions such as
Multimodal
Multimodal
shame, pride, etc. are used. (Mathews et al., 2016) described a method
Attention
+ Visual
Citations Method
known as SentiCap which can generate image captions using positive

LSTM
and negative sentiments. This method combined two CNN + RNNs

running in parallel in which one is responsible for the generation of non-
factual words while the other generated the words with sentiments. This
technique produces emotional image captions using only 2000 +
11
training sentences which consisted of sentiments and produced 86.4%

Table 1 (continued )
positive captions. Stylized image captioning has gained popularity in the

(Xian & Tian, 2019)
past few years and therefore advancements are being made in this
(Chen & Zhuge,
technique through multi-style image captioning. (Guo, Liu, Yao, Li, &
Lu, 2019) claimed multi-style image captioning, known as MSCap with a
2020)
standard factual image caption dataset and a multi-stylized language

Ref.
corpus with unpaired images. This model contains four modules namely
11
D. Sharma et al.
Table 2
Overview of Single Sentence IC for Encoder-Decoder-Based Methods Without Attention.
Ref. Citations Method Dataset Image Text Advantages Limitations GitHub Link
Encoder Generation
(Mao, et al., 2016) 528 CNN-LSTM MSCOCO VGGNet LSTM The model task allows for A particular kind of object is https://github.com/mjhucla/Google_Refexp_toolbox
easy objective evaluation too small to detect. Lacks
enough training data.
(Jia, Gavves, Fernando, 374 LSTM + g- Flickr30K VGGNet LSTM – – https://github.
& Tuytelaars, 2015) LSTM MSCOCO com/MITESHPUTHRANNEU/Image-Caption-Generator
Flickr8K
(Vinyals, Toshev, 5094 LSTM (CNN- Flickr30K GoogleNet LSTM Robust model Improvement in https://github.com/nikhilmaram/Show_and_Tell
Bengio, & Erhan, RNN) MSCOCO descriptions can be obtained
2015) Pascal with the use of unsupervised
data
(Pu, et al., 2016) 558 CNN + DGDN Flickr8K, VGGNet LSTM Model is learned using Complex model due to https://github.com/shivakanthsujit/VAE-PyTorch
Flickr30K GRU variational auto-encoder DGDN
MSCOCO with semi-supervised
learning
(Wu, Shen, Liu, Dick, & 403 CNN-RNN Flickr8K, VGGNet LSTM Enhancements in evaluation There is a big gap between https://github.com/liuqihan/Image-Caption
Hengel, 2016) Flickr30K metrics can be observed. this model and human
MSCOCO performance, low accuracy
(Wang, Yang, Bartz, & 207 LSTM + bi- Flickr8K, AlexNet, LSTM The effectiveness, generality, A better result can be https://github.com/deepsemantic/image_captioning
Meinel, 2016) LSTM Flickr30K VGGNet and robustness of proposed obtained with the attention
MSCOCO models were evaluated on mechanism
numerous datasets
(Donahue, et al., 2015) 5505 CNN-LSTM + Flickr30K VGGNet RNN Learning sequential Complex Model https://github.
LRCN MSCOCO CaffeNet LSTM dynamics with a deep com/ekinakyurek/Long-Term-Recurrent-Convolutional-NN
sequence model shows
12
improvement when
compared with other
methods.
(Ren, Wang, Zhang, Lv, 269 CNN-RNN MSCOCO VGG-16 RNN The proposed framework is This model fails to https://github.com/Pranshu258/Deep_Image_Captioning
& Li, 2017) modular w.r.t. the network understand some important
design visual contents that only
take small portions of the
images
(Dai, Fidler, Urtasun, & 383 CGAN Flickr30K VGG-16 LSTM This framework provides an Major errors are the https://github.com/doubledaibo/gancaption_iccv2017
Lin, 2017) MSCOCO evaluator that is more inclusion of incorrect
consistent with human details. e.g., colors (red/
evaluation. yellow hat), and counts
(three/four people)
(Rennie, Marcheret, 1161 SCST MSCOCO ResNet LSTM Provides an improvement in – https://github.com/ruotianluo/self-critical.pytorch

Mroueh, Ross, & Goel, the CIDEr score
2017)
(Yao, Pan, Li, & Mei, 422 GCN-LSTM MSCOCO ResNet LSTM Explores visual relationships The complexity of the model –
Exploring Visual which enhance the image increases for the
Relationship for image captioning by increasing the determination of spatial
captioning, 2018) CIDEr-D score. object relationship
(Chen, et al., 2019) 43 CNN + RNN MSCOCO ResNet LSTM Provides optimization in Generation of duplicate https://github.com/beckhamchen/ImageCaptionGAN
with GAN evaluation metrics like words or phrases is an issue
CIDEr, BLEU, and SPICE. with this model
(Qiu, et al., 2021) – Egocentric EgoShots ResNet Transformer Provides minute details from – https://github.com/NataliaDiaz/Egoshots
Image the images
Captioning
(Mishra, Dhir, Saha, – CNN + RNN MSCOCO ResNet, LSTM, GRU Outperforms other methods Errors are there in the –
Bhattacharyya, & InceptionV4 of captioning from English to recognition of objects in
Singh, 2021) Hindi. images.
a style-dependent caption generator, a caption discriminator, and a style

classifier, and lastly, experiments are conducted which demonstrate the
outstanding performance of the work (Guo, Liu, Yao, Li, & Lu, 2019).
https://github.com/dabasajay/Image-Caption-Generator
https://github.com/yahoo/object_relation_transformer
(Chen et al., 2018b) proposed a new variant of LSTM named style-
factual LSTM which was used for the generation of captions that had a
specific style. Without the use of extra ground-truth supervision, the
method proposed by (Chen et al., 2018b) outperformed the various
state-of-the-art approaches by using factual and stylized knowledge.
Stylized-based captioning suffers from limited style variation and con
tent regression. (Chen et al., 2018a) described a controllable stylish
image description model which generated various stylist captions by
plugging in style-specific parameters. (Zhao et al., 2020) proposed a
method MemCap which explicitly encoded the knowledge about lin
guistic styles with memory mechanisms. MemCap first extracts content-
GitHub Link
relevant style knowledge from the memory module with an attention

mechanism and further, the extracted knowledge was incorporated into
a language model. The effectiveness of (Zhao et al., 2020) is demon
strated by StyleNet and FlickerStyle10K datasets. To increase the di
–
versity of captions (Heidari et al., 2020) presented a framework named

categories objects, relations,
Mixture of Recurrent Experts (MoRE) that derived SVD from weighting

Small objects are detected
which are grouped into 4
incorrectly in some cases

62 errors were observed
matrices of RNN. This model generated diverse and stylized descriptions

attributes, and syntax
of images also in terms of content accuracy. (Li et al., 2021) described

stylized image captioning with paired stylized data that extracted style
Limitations
phrases from small-scale stylized sentences and graft them to large-scale

factual captions. Another work by (Tan, et al., 2022) proposed an un
supervised stylized transformer-based image captioning model. It de
–
taches style presentations from a large stylized text-only corpus and

attaches the separated style representations to the image content. (Wu,
transformer technique which
Modification in conventional
Generates better descriptive
can encode 2D position and
Zhao, & Luo, 2022) defined reinforcement learning model based on

improvement in terms of
Provides state-of-the-art
results on the MSCOCO
stylized image captioning that learns multiple cooperative neural

modules namely, syntax, concept, and style. The method (Wu, Zhao, &
Provides 10.37%
Luo, 2022) successfully provides an improvement in terms of relevancy,

text for images
size of objects.
CIDEr metric
Advantages
stylishness, and fluency of the generated captions.

As observed, Stylized ICT based approaches (Li et al., 2021), (Tan,
dataset
et al., 2022), (Gan et al., 2017) are the promising way to generate
visually stylized, attractive, holistic, diverse, and accurate captions.
However, it is not yet fully utilized the task of the image captioning field.
Transformer
Generation
There exist few works (Guo, Liu, Yao, Li, & Lu, 2019), (Mathews et al.,
2016) that, more or less, describe the images with inappropriate content
LSTM
LSTM
Text
GRU
in description and/or sentiments or both. This is due to limited available

data for the stylized captioning task. The small-size datasets may lead to
overfitting resulting in variations between ground truth and generated
InceptionV3
Encoder
captions. Hence, we can fully utilize the strengths of Stylized ICT using
ResNet
ResNet
ResNet
Image
large-scale datasets by covering a wide range of diverse human emo

tions/sentiments.
3.1.2.1.4. Dense or Paragraph Image Captioning Task. The image
captioning tasks (ICT) discussed in the above sub-sections 3.1.2.1.1,
Flickr30K
MSCOCO
MSCOCO
MSCOCO
MSCOCO
Dataset
3.1.2.1.2, and 3.1.2.1.3, generate one sentence to describe an image.

These tasks fail to retain and process fine-grained details for caption
generation. Moreover, all the detected attributes cannot be covered by
these tasks in the generated sentence/ caption. Hence, the generated
Multimodal-
Transformer
captions by single sentence ICT, Novel object ICT, and stylized-based

based GRU
Attention-
Relation
Method
R-LSTM
ICT are limited by specific grammar. Whereas, the Dense or Paragraph

CNN +
Object
TD-
Image Captioning Task opens up a new dimension of coherent and dense

FC
image descriptions. It extracts densely correlated features considering

fine-grained details of an entire image that are missed out or could not
Citations
be handled while generating captions by other ICTs. The procedural

113
34
23
steps involved in the dense image captioning task are as follows: (1)
–
Region proposals are generated for the different regions of the given
image. (2) CNN is used to obtain the region-based image features. (3)
(Patwari & Naik, 2021)
(Chen, Ding, Zhao, &

Boakye, & Soares,
(Herdade, Kappeler,
The outputs of Step 2 are used by a language model to generate captions

(Ding, et al., 2019)
for every region. Fig. 10 (a) exhibits the basic principle involved in the
Han, 2018)
dense captioning technique. An example of image paragraph generation

is shown in Fig. 16 (b), which supports that dense captioning generates
2020)
the description for each region and each description is independent,

Ref.
while an image paragraph can generate the paragraph with related
13
Table 3
Overview of Single Sentence IC for Compositional-Architecture-Based Methods.
Encoder Generation
(Fang, et al., 1280 CNN-LSTM MSCOCO AlexNet, MELM Can extract nouns, verbs, BLEU and METEOR –
2016) VGGNet and adjectives from all metric is very low.
regions of the image,
captions generated are
better than human-
generated captions 34% of
the time.
(Tran, et al., 125 ConvNet+ MSCOCO, ResNet MELM Detects a broad range of Instagram images are –
2016) DMSM Adobe- visual concepts and filtered images or
MITFiveK generates rich captions. handcrafted abstract
Instagram pictures which are
images difficult to process.
(Ma & Han, 21 LSTM UIUC Pascal AlexNet LSTM Structural words are – –
2016) Dataset generated which provide a
Flickr8K semantically meaningful
description of images.
(Wang, Song, 39 RNN-LSTM Flickr8K VGGNet LSTM RNN-LSTM provides better Parallel threads lead to https://github.
Yang, & Luo, results than dominated many complexities in com/karpathy/
2016) architectures with the model neuraltalk
improvement in efficiency
(Tan, Feng, & 43 CNN-RNN MSCOCO ResNet GRU Model capture finer The inception score does https://github.
Ordonez, Abstract semantic concepts from the not evaluate the com/uvavision/
2019) Scenes visually descriptive text and correspondence Text2Scene
generates captions for between text and images
complex scenes
(Nikolaus, 14 LSTM MSCOCO ResNet LSTM Produce captions with Model is better at https://github.
Abdou, Embedding + combinations of unseen generalizing to com/
Lamm, Attention objects thereby providing transitive verbs than mitjanikolaus/
Aralikatte, & improvements in intransitive verbs compositional-
Elliott, 2019) generalization performance image-captioning
(Tian & Oh, 3 LSTM + MSCOCO ResNet LSTM The framework is easily – –
2020) Attention expandable to include
additional functional
modules of more
sophisticated designs
(Bugliarello & 1 RNN + MSCOCO LSTM LSTM Shows consistent Model complex multi- https://github.
Elliott, 2021) Transformer improvements especially for task model com/e-bug/syncap
inanimate color-noun
combinations.
sentences. advancements in dense image captioning are related to precise feature

In this direction, DenseCap (Johnson et al., 2016) was the first extraction that realizes a complete understanding of an image by
attempt to generate dense captions. It jointly addressed localization and localizing and describing multiple salient regions. (Zhang, et al., 2019)
description tasks with the help of Fully Convolutional Localization defined precise feature extraction (PFE) to provide enhanced dense
Network (FCLN) architecture. The FCLN architecture is composed of captions of images. The dense relational captioning method (Kim et al.,
CNN, RNN, and a novel dense localization layer. DenseCap (Johnson 2020), generated multiple captions that provide relational information
et al., 2016) network was evaluated on the Visual Genome dataset which between objects and explicit descriptions for a different combinations of
provided improvements in speed and accuracy. Another work by (Ren, objects. This method was beneficial in terms of diversity and amount of
He, Girshick, & Sun, 2015) used Region Proposal Network (RPN), information. Dense Image captioningKrishna et al., (2017, May). models
trained to generate high-quality region proposals. RPN is a fully con are relatively independent and generate unrelated captions.
volutional network, merged with Fast R-CNN for detection. (Yang, Tang, Paragraph generation is based on finding the relationship between
Yang, & Li, 2016) presented a unified method that was based on two objects and co-referencing these objects while generating sentences.
novel concepts: (1) joint inference; which jointly depends on visual (Krause, Johnson, Krishna, & Li, 2016) attempted to generate dense
features and predicted captions of regions and (2) context fusion; which paragraphs in the form of detailed and unified stories by detecting se
combines context features and visual features for a description of rich mantic regions in images using a hierarchical neural network. It helped
captions for a particular image to generate dense captions. Dense Se to generate coherent sentences to describe an image diversely. Thereby
mantic Embedding Network-LSTM (DSE-LSTM) (Xiao, Wang, Ding, increasing the model’s complexity. (Wang, Luo, Li, Huang, & Yin, 2018)
Xiang, & Pan, 2019) preferred to extract dense semantic embeddings presented a Depth Aware Attention Model (DAM) that organizes sen
and predicted a word for every semantic feature at each step. (Kim et al., tences orderly, and coherently. The depths of image areas are estimated
2020) introduced relational captioning that generated multiple captions and spatial relationships between objects are revealed by a linguistic
with the help of relational information between objects in an image. The decoder. In this work, the effectiveness of captions is reduced due to the
work (Kim et al., 2019) (Kim et al., 2020), presented a multi-task triple- lack of diversity between sentences. Whereas, (Kyriazi et al., 2018)
stream network (MTTSNet) to generate diverse and rich captions on considered sequence-level training and produced diverse paragraph
large-scale datasets. Another work, (Shao, Han, Marnerides, & Debat descriptions with an integrated penalty on trigram repetition. Zha et al.,
tista, 2022) proposed a transformer-based dense image captioner that (2022, October). managed to generate longer, richer, and more fine-
prioritized more informative regions to generate dense captions. This graded descriptions of an image in a paragraph by presenting a
method prioritized more informative regions by learning the mapping Context-Aware Visual Policy (CAVP) network. (Li, Liang, Shi, Feng, &
between images and their corresponding dense captions. Recent Wang, 2020) also attempted a method for coherent paragraph
14
Table 4
Overview of Single Sentence IC for Encoder-Decoder Attention-Mechanism-Based Methods.
Encoder Generation
(You, Jin, 1342 CNN-RNN MSCOCO GoogleNet RNN A combination of top- An incorrect visual https://github.com/
Wang, Fang, (Semantic Flickr30K down and bottom-up attribute may disrupt chapternewscu/image-
& Luo, Attention) strategies extracts rich the model to attend to captioning-with-
2016) information from images incorrect concepts. semantic-attention
(Park, Kim, & 126 CMSN Instagram ResNet LSTM First personalized image Absolute metric values https://github.com/
Kim, 2017) Dataset captioning approach with for the Instagram cesc-park/attend2u
hashtag prediction and caption are low.
post generation
(Tavakoli, 60 Deep CNN + MSCOCO VGGNet LSTM – – https://github.com/
Shetty, LSTM HemanthTejaY/Deep-
Borji, & Learning-Image-
Laaksonen, Captioning—A-
2017) comparitive-study
(Chen L., 1043 SCA-CNN MSCOCO VGGNet, LSTM Provides improvements in Improvements in https://github.com/
et al., 2017) Flickr8K ResNet the description of images results can be seen zjuchenlong/sca-cnn.
Flickr30K with the temporal cvpr17
attention mechanism
(Liu, MAo, 191 CNN-LSTM Flickr30K VGGNet LSTM Attention maps provide a The quantitative –
Sha, & MSCOCO positive correlation results show that there
Yuille, between attention is room for
2017) correctness and improvement to
captioning quality. improve the
captioning
performance.
(Lu, Xiong, 998 LSTM + MSCOCO ResNet LSTM This model provides a Models give poor https://github.com/
Parikh, & Spatial Flickr30K fallback option for the results for smaller jiasenlu/
Socher, Attention decoder which makes this objects like “surf- AdaptiveAttention
2017) model be used in many board”, “clock” etc.
other applications
excluding image
captioning
(Pedersoli, 157 CNN-RNN MSCOCO VGGNet RNN This model is the first step Background elements https://github.com/
Lucas, towards weakly are missing in some marcopede/
Schmid, & supervised learning captions AreasOfAttention
Verbeek,
2017)
(Anderson, 2370 R-CNN MSCOCO ResNet LSTM This method enables Feature binding https://github.com/
et al., 2018) Soft Attention attention to be calculated problem arises which peteanderson80/
more naturally at the level can be further resolved bottom-up-attention
of objects and other using attention
salient regions,
(Huang, 250 AoANet MSCOCO ResNet LSTM Experiments are Complexity increases https://github.com/
Wang, conducted on MSCOCO with two attention on husthuaan/AoANet
Chen, & which demonstrates this attention.
Wei, 2019) model is superior and
effective when compared
with human evaluation
(Liu M., Li, 15 CNN-RNN + AIC-ICC The model is effective and – –
Hu, Guan, & FCN feasible in image caption
Tian, 2020) generation and the model
is merged with FCN.
(Deng, Jiang, 6 LSTM Flickr30K DenseNet LSTM Significant improvement – https://github.com/
Lan, Huang, Adaptive MSCOCO can be observed in BLEU s1879281/Image-
& Luo, Attention and METEOR scores Captioning-with-
2020) which improves the Adaptive-Attention
quality of image
captioning
(Yan, et al., 13 Adaptive MSCOCO R-CNN Transformer This model is useful for – –
2021) Attention + the generation of non-
Vanilla visual words.
Transformer
(Zhang, Wu, 4 Parallel MSCOCO ResNet LSTM The model can generate This model can cause –
Wang, & Attention high-quality captions by an inaccurate
Chen, 2021) Mechanism capturing related visual description of image
relationships for captioning.
generating accurate
interaction descriptions
57 – MSCOCO ResNet LSTM Incorporated attribute – –
(Chen, Ding, information and their
Lin, Zhao, & corresponding context
Han, 2018) features into the decoder
15
Encoder Generation
for the generation of

sentences
(Xiao, Xue, 4 Att-LSTM Flickr30K ResNet LSTM Provides state-of-the-art Does not work for all –
Shen, & MSCOCO results on both datasets. images (ignores small
Gao, 2022) objects in some cases)
(Liu, et al., 1 CIIC MSCOCO Faster R- Transformer The proposed model – https: //github.com/
2022) CNN disentangles the visual CUMTGG/CIIC
features and facilitates the
deconfounding of image
captioning
(Zeng, Zhang, 5 S2 MSCOCO Faster R- Transformer The proposed model can –
Song, & Transformer CNN generate more accurate
Gao, 2022) and diverse descriptions
compared to the basic
transformer model.
(Jia, Wang, – SAET MSCOCO Faster R- Transformer The proposed method Increasing the number –
Peng, & CNN provides a semantic of decoder layers
Chen, 2022) association between results in no significant
objects with weak improvement
appearance features
(Lu, et al., 23 DenseNet- AI DenseNet Bi-LSTM The model can effectively – –
2021) BiLSTM Challenger improve the description
effect for image detail
information
(Liu, Hu, Li, 90 NICVATP2L AIC-ICC Inception- LSTM The defined model When there are –
Yu, & Guan, V3 produces more multiple attributes
2020) satisfactory Chinese within an entity, only
captions for the given one attribute is usually
image. captured in the
description.
generation using CNN. The dual-CNN decoder produced a semantically encoding and incorporating different styles of human emotional ex
coherent description of images and provided state-of-the-art results on pressions in dense/ paragraph descriptions of images. In addition to this,
the Stanford Image paragraph dataset. (Cornia, Stefanini, Baraldi, & the dense/paragraph-based ICT suffers from visual ambiguity, geometric
Cucchiara, 2020) presented a transformer-based method known as ambiguity, and illumination which degrade the quality of generated
meshed memory transformer for caption generation. The method captions.
devised mesh-like connectivity at the decoder to exploit low-level and 3.1.2.1.5. Change Image Captioning (CIC) Task. The captioning tasks
high-level features. (Guo, Lu, Chen, & Zeng, 2021) presented a hierar discussed in sub-sections 3.1.2.1.1 and 3.1.2.1.4 deal with the detection
chical topic-guided image paragraph generation framework also known and recognition of objects that help extract semantic features and
as the Visual-Textual Coupling Model (VTCM) that coupled a visual describe these features in the form of natural language. In this section,
extractor with a deep topic model. LSTM and transformer are jointly we will elucidate on Change Image Captioning (CIC) Task, which
optimized to guide the generation of paragraphs and provided improved identifies a change and describes it in an incisive manner. CIC differ
results (Cornia, Stefanini, Baraldi, & Cucchiara, 2020) for the Stanford entiates two images of a changing scene, captured at different time
Image Paragraph dataset. The work (Ilinykh & Dobnik, 2020), described steps- before and after the scene. This helps to observe possible analyt
visual scenes with long sentences. It included perceptual and semantic ical changes that occurred in the scene. This view of image caption
information and described what is in the image. (Yang et al., 2020) generation gives an active and attentive eye on the attributes and the
presented Hierarchical Scene Graph Encoder-Decoder (HSGED) that change in attributes with time steps in an image. This process requires an
generated coherent and distinctive paragraphs and achieved state-of- additional reference image to identify the change over time and
the-art results on the Stanford Image Paragraph dataset. The image generate the captions adaptively. Hence, we can say that the CIC method
paragraph captioning method should generate consistent sentences utilizes the time dimension along with the spatial dimension of an
rather than contradictory ones. To overcome this, (Yang et al., 2021) image. It is highly useful in real-time applications such as aerial imagery
presented a method that incorporated incorporate objects’ spatial Liu et al., (2018, December). (Tian, Cui, & Reinartz, 2014) analysis for
coherence into a language-generating model. This method achieved disaster response systems (Gueguen & Hamid, 2015) and monitoring of
promising results by extracting effective object features for image land cover dynamics (Khan, He, Porikli, & Bennamoun, 2017), and
paragraph captioning. Also, to generate more coherent paragraphs (Shi, street scene surveillance (Alcantarilla, Stent, Ros, Arroyo, & Gherardi,
et al., 2021) presented a tree-structured visual paragraph decoder 2018).
network. Experimental results on the Stanford Paragraph dataset prove CIC is a very challenging task that aims to describe the subtle dif
the efficiency and efficacy of the tree-structured-based decoder. Dense- ference between two similar images in form of natural language. Images
or-paragraph-based ICT generates efficient and dense descriptions of acquired from different camera perspectives and different illumination
images including object details and small detections. This task provides exposure make CIC more challenging. Therefore, change detection in an
promising performance in the generation of coherent and diverse par image becomes an essential step to generate a caption. Earlier methods
agraphs. With the advancements made in deep-learning techniques, this Liu et al., (2018, December). (Tian, Cui, & Reinartz, 2014) (Khan, He,
task captures the correlation between images and text at multiple levels Porikli, & Bennamoun, 2017) utilized an unsupervised approach to
of abstraction. Though, dense captioning supports end-end training, detect changes due to the high cost involved in labeling the large
these models fail to describe interactions between the objects. Further, ground-truth samples. Further, the semi-supervised approach (Gueguen
there still exists room for improvement by considering richer visual & Hamid, 2015) relies on hierarchical shape representation. Another
16
Table 5
Overview of Single Sentence IC for Encoder-Decoder Semantic Concept-based Image Captioning Methods.
Encoder Generation
(Karpathy & Fei-Fei, 4764 CNN + bi- Flickr8K, VGGNet RNN Provides image sentence Generates descriptions at a https://github.
Deep Visual- RNN Flickr30K rankings that provide rich fixed resolution and com/jonkuo/
Semantic MSCOCO descriptions of images. additive bias interactions in Deep-Learning-
Alignments for RNN are less expressive. Image-
Generating Image Captioning
Descriptions, 2015)
(Yao, Pan, Li, Qiu, & 506 LSTM-A MSCOCO Google LSTM Provides improvements in – –
Mei, Boosting Net high-level attribute
image captioning representation of images
with attributes,
2017)
(Xu, et al., 2015) 7942 CNN- Flickr8K AlexNet LSTM Provides state of art results More extensive https://github.
LSTM Flickr30K for BLEU and METEOR for visualization is required. com/yunjey/
MSCOCO all three datasets as this show-attend-
model can attend non- and-tell
objects salient features.
(Wu, Shen, Wang, 255 CNN-RNN Flickr8K VGGNet LSTM This method provides Knowledge-based queries –
Dick, & Hengel, Flickr30K more accuracy and cannot be handled by this
2018) MSCOCO outperforms other state-of- model.
art methods
(Gao, Wang, & Wang, 13 CNN-RNN- MSCOCO VGGNet LSTM Experimental results Difficult to train images –
2018) SVM provide superior and with the scene graph
competitive results as the
scene graph improves the
performance of the model.
(Zhang, et al., 2020) 11 LSTM and MSCOCO VGGNet, RNN EE-LSTM language model There are some negative https://github.
EE-LSTM Flickr8K ResNet generates sentences with examples of the proposed com/surgicaI/
Flickr30K more details and method as some images image-
outperforms LSTM by a have the only object due to captioning
significant margin. which the advantage of EE-
LSTM cannot be fully
exploited.
(Liu & Xu, Adaptive – CNN + MSCOCO – – The model generates a The model has a strong –
Attention-based LSTM + Flickr30K more comprehensive and dependence on the accuracy
High-level Attention smooth natural language of high-level semantic
Semantic description. acquisition.
Introduction for
Image Caption,
2020)
(Shi, Zhou, Qiu, & 12 CGVRG MSCOCO ResNet LSTM Caption-guided visual A complex model that can https://github.
Zhu, 2020) Visual representation graphs provide better results if com/
Genome provide enhancement in applied to several other Gitsamshi/
text and visual features languages, visual modeling WeakVRD-
tasks Captioning
(Tripathi, Nguyen, 1 SG2Caps MSCOCO MotifNet LSTM Generates high-quality More research is needed to https://github.
Guha, Du, & (GCN- Visual captions without using reach human-level accuracy com/kien085/
Nguyen, 2021) LSTM) Genome visual features. and diversity. sg2caps
V-COCO
work (Sakurada et al., 2017) detected changes based on dense optical level matching without any explicit supervision. Further, (Tu et al.,
flow to address the difference in viewpoints. The works of (Feng, et al., 2021) worked on a semantic relation-aware difference representation
2015) and (Huang, et al., 2017) addressed more subtle, fine-grained learning network. This network explicitly learned the difference repre
change detection, where an object may change its appearance over sentation of the existence of distractors. (Hosseinzadeh & Wang, 2021)
time. To tackle this problem, (Stent, Gherardi, Stenger, & Cipolla, 2016) focused on a training scheme that used an auxiliary task to improve the
estimated a dense flow field between images to address viewpoint dif training of the network. This auxiliary network helped in the generation
ferences. In this direction, Park et al. (Park et al., 2019) defined a Dual of precise and detailed captions and provided state-of-the-art results on
Dynamic Attention Model (DUDA) based on attention mechanism rather benchmark datasets for change captioning. Viewpoint-Agnostic change
than pixel-level difference or flow. The model distinguished relevant captioning network with Cycle Consistency (VACC) (Kim et al., 2021)
scene changes from illumination/viewpoint variations with the help of a method explicitly distinguish between real change from a large amount
dynamic attention scheme. The DUDA framework, Fig. 11, consists of of clutter and irrelevant changes (Tu, Li, Yan, Gao, & Yu, 2021) worked
two main components: (i) Dual attention, and (ii) Dynamic Speaker. upon Relation-embedded Representation Reconstruction Network
Dual Attention processes multiple visual inputs separately and addresses (R3 Net) and provided state-of-art results on Spot-the-Diff and CLEVR-
Change Captioning in the presence of distractors. However, Dynamic change dataset. The existing scene changes captioning approaches
Speaker predicts attention o(Shi et al., 2020a)ver the visual features at recognize and generate change captions from single-view images. Those
each time step and obtained the dynamically attended feature. Another methods have limited ability to deal with camera movement and object
work ( presented Mirrored Viewpoint-Adapted Matching (M− VAM) occlusion, which is common in real-world settings. To resolve these is
encoder to distinguish viewpoint changes from semantic changes. sues, (Qiu et al., 2020) Qiu et al., (2020, August). presented a framework
M− VAM self-predicted the difference (changed) region maps by feature-
17
Table 6
Overview of Novel-object based Image Captioning Task.
Encoder Generation
(Hendricks, et al., 2016) 259 DCC MSCOCO VGGNet LSTM Describes new objects Some sentences http://www.eecs.
ImageNet which are not present in generated are berkeley.edu/
current caption corpora, grammatically incorrect ~lisa_anne/
provides rich but they do incorporate dcc_project_page.
descriptions of images new words. html
(Yao, Pan, Li, & Mei, 119 LSTM-C MSCOCO VGGNet LSTM Improvement in Not suitable for large- –
Incorporating copying ImageNet performance can be scale image benchmarks
mechanism in image observed YFCC100M
captioning for
learning novel objects,
2017)
(Venugopalan, et al., 146 NOC MSCOCO VGGNet LSTM The model can describe This model fails to https://github.com/
2016) ImageNet many more novel objects describe new objects. vsubhashini/noc
and provide state-of-art
results
(Wu, Zhu, Jiang, & 43 DNOC MSCOCO VGGNet LSTM Evaluation examples The complex structure https://github.com/
Yang, 2018) contain unseen objects of the model Yu-Wu/Decoupled-
and no additional Novel-Object-
sentence data is Captioner
available.
(Li, Yao, Pan, Chao, & 41 LSTM-P MSCOCO VGGNet LSTM Covers more objects in Placements and –
Mei, 2019) ImageNet the generation of moments of copying
captions and thus novel objects in
improves the captioning sentences are not fully
mechanism. Improved yet understood in the
F1 scores literature.
(Agrawal, et al., 2019) 11 nocaps nocaps ResNet LSTM Dataset provides better Analysis shows that https://github.com/
MSCOCO object detection and there is significant room nocaps-org/updown-
improvements in the for improvement in the baseline
captioning mechanism image captioning task.
are obtained.
(Feng, et al., 2019) 22 Cascaded MSCOCO VGGNet LSTM Can efficiently describe METEOR score is lower https://github.com/
Revision ImageNet images with unseen than LSTM-C qy-feng/CRN
Network objects.
(Hu, et al., 2021) 1 VIVO MSCOCO – – Achieved new state of art Needs to leverage a –
nocaps results on nocaps and large amount of vision
surpassed the human data to provide
CIDEr score improvement in visual
vocabulary
(Cao, et al., 2020) 6 FDM-net nocaps ResNet Scene Graph Feature maps are created – –
to bridge the gap
between expected visual
information and
generated information
(Wu, Jiang, & Yang, 3 S-NOC nocaps VGGNet S-LSTM This method – –
Switchable Novel outperforms other state-
Object Captioner, of-art without using any
2023) additional sentence data.
that described changes from multiple viewpoints (or 3-D vision) in form space. These methods explore visual relationships which enhance the
of natural language. image captioning and provide minute details from the image in a single
Change ICT is a very challenging task and a stepping stone in the field sentence. There still exist some cases that fail to incorporate captioning
of visual captioning. It generates descriptions with semantic changes in real scenes, and cannot cover the rich underlying semantics. The
from images and achieved state-of-the-art results. CIC method utilizes encoder-decoder-based methods follow encoder-based image under
time dimension along with the spatial dimension of an image which has standing and decoder-based text generation. These methods provide
huge potential applications in aerial imagery, analysis for disaster improvements in high-level attribute representation of images. Further,
response systems and monitoring of land cover dynamics, and street these methods suffer from feature binding problems which may degrade
scene surveillance. Further, to provide more fine-grained difference the performance of the caption generated. Whereas, compositional
representation this task needs a large amount of data with more complex architecture-based methods generate multiple captions at the output
scenes, object models with high diversity, and placing object models at and re-rank these captions to ensure high-quality captions. These
various locations in the scene. methods are capable of producing captions that include combinations of
unseen object and provides improvements in generalization perfor
3.1.2.2. Image Captioning Methods. Deep-learning-based image mance. Further, there still exists room for improvement in the perfor
captioning methods are broadly categorized into three main categories mance of these methods as these methods somehow fail to capture small
namely: (i) Multimodal-learning-based IC methods, (ii) Encoder- objects in the descriptions generated. Furthermore, different models
Decoder-based IC methods, and (iii) Compositional Architecture-based related to the above-mentioned image captioning methods are discussed
IC methods. Multimodal-learning-based image captioning methods in detail in the following sub-sections highlighting their pros and cons.
deal with the simultaneous learning of image and text in multimodal 3.1.2.2.1. Multimodal Learning-based Image Captioning Methods.
18
Table 7
Overview of Stylized Image Captioning Task.
Encoder Generation
(Mathews, Xie, 137 SentiCap MSCOCO GoogLeNet LSTM Able to generate emotional More or less https://github.
& He, 2016) captions for over 90% of the inappropriate in com/lexingxie/
images and are evaluated content description or cmLab/blob/
using crowdsource and sentiments or both for master/content/
automatic evaluations. some images post/senticap.md
(Gan, Gan, He, 197 StyleNet FlickerStyle10K ResNet LSTM Able to learn styles from a Romantic and https://github.
& Gao, 2017) monolingual textual corpus. humorous styles are com/kacky24/
Can generate visually combined but no stylenet
attractive and stylish improvements were
captions there.
(Guo, Liu, Yao, 40 MSCap COCO, SentiCap, ResNet LSTM Better fluency than MSCap rates can be –
Li, & Lu, FlickerStyle10K StyleNet. Captions improved which can
2019) generated are fluent and further improve the
relevant and are correctly efficacy
stylized
(Chen T., et al., 39 Styled MSCOCO VGGNet LSTM Captures both factual and For negative caption –
2018) Factual- FlickerStyle10K stylized information generation, the
LSTM performance is
competitive with
SentiCap
(Chen C.-K., 9 LSTM with MSCOCO ResNet RNN This model progressively Transfer accuracy of –
Pan, Sun, & Domain includes new styles which the source to humor is
Liu, 2018) Layer are more preferred by low and for lyrics style,
Norm human subjects. it is also low
Zhao et. al 14 MemCap SentiCap VGGNet LSTM Generates sentences that – https://github.
(Zhao, Wu, & FlickerStyle8K describe the content of the com/entalent/
Zhang, 2020) image accurately and reflect MemCap
the desired linguistic style
appropriately
(Li, Zhai, Lin, & – SAN SentiCap ResNet LSTM The framework generates Noise may be –
Zhang, 2021) FlickerStyle8K corresponding stylized introduced in each
captions for images in the process
large-scale factual corpus.
(Heidari, 1 MoRE MSCOCO Inception- RNN Provides improvements in – https://github.
Ghatee, V3 terms of accuracy, diversity, com/marzi-
Nickabadi, & and styled captions heidari/styled-
Nezhad, and-diverse-
2020) image-captioning
(Tan, et al., 1 Detach & SentiCap Faster R- Transformer Provides more holistic style – –
2022) Attach FlickerStyle8K CNN supervision to generate
stylized captions
(Wu, Zhao, & – SIC SentiCap ResNet Scene Graph Provides improvements w.r. This method lacks –
Luo, 2022) FlickerStyle8K t. relevancy and stylishness information passing
of generated sentences. between different
neural models
Multimodal learning-based image captioning models learn both image method (Mao, et al., 2015) generated the probability distribution of a
and text jointly in multimodal space. The basic steps of such models word given a previous work and an image. The effectiveness of m-RNN is
include an image encoder, a language encoder, and the projection of evaluated on four benchmark datasets namely IAPR TC-12, Flickr8K,
encoded vectors in multimodal space, followed by a language decoder. Flick30K, and MSCOCO. (Chen & Zitnick, 2015) restored visual features
Projection of image and text encoded vectors into multimodal space from the given description. It can generate sentences and retrieve both
maps the image features into a common space with the word features. It images and sentences. This method defined an additional recurrent vi
helps these models to learn richer discriminant features for each sample sual hidden layer with RNN that made a reverse projection which pro
and yield improved captions. (Kiros et al., 2014a) were the first to vided better results when compared with other state-of-the-art. (Cheng,
propose a multimodal learning-based image captioning method. This et al., 2017) presented a hierarchical multimodal learning model with an
technique has an advantage over traditional ones as this method de attention mechanism. The model included a CNN network for image
scribes images without the use of templates, syntactic trees, and struc encoding, an RNN network for the identification of objects in images in a
tured prediction. This concept can also be extended to other modalities sequential manner, and a multimodal learning-based RNN with an
like audio. (Kiros et al., 2014b) proposed an extension of (Kiros et al., attention mechanism for caption generation with intermediate semantic
2014a)which learns a joint image sentence embedding with the use of objects and global visual contents. (Liu et al., 2017) defined a sequence-
LSTM for sentence encoding. It used a new language model named as to-sequence RNN model. This model extracted the object’s features in an
Structure-Content Natural Language Model (SC-NLM) for caption gen image and arranged them in order using a CNN-based structure that
eration and reported results superior to those (Kiros et al., 2014a). generates the corresponding words in the sentences.
(Karpathy et al., 2014) presented a bidirectional retrieval of images and Recent techniques for multimodal-based captioning of images are
sentences with the use of deep, multimodal embeddings of visual and based on CNN and RNN models which have achieved excellent perfor
natural language data. This model worked on a finer level and mance in captioning of images. (Zhao, Chang, & Guo, 2019) proposed a
embedded fragments of objects and sentences to interpret predictions multimodal fusion method for the description of images as depicted in
for the image-sentence retrieval task. Multimodal-RNN (m-RNN) Fig. 12. This model generated captions in four parts i.e., a CNN for
19
D. Sharma et al.
Table 8
Overview of Dense or Paragraph Image Captioning Task.
Encoder Generation
(Ren, He, Girshick, & Sun, 24,897 RPN PASCAL VOC VGGNet R-CNN Efficient and accurate region Complexity increases with https://github.com/ShaoqingRen/
2015) 2007 proposal framework. Provides Fast R-CNN faster_rcnn
object detection accuracy
(Johnson, Karpathy, & Fei- 1001 CNN-RNN Visual VGGNet LSTM Supports end-to-end training and Failure occurs in some cases https://github.com/jcjohnson/densecap
Fei., Densecap: Fully Genome efficient test-time performance and where interaction between
convolutional localization provides visually pleasing results. objects can be seen.
networks for dense
captioning, 2016)
(Krause, Johnson, Krishna, & 245 HRN MSCOCO VGGNet RNN This model interpretability – https://github.com/chenxinpeng/im2p
Li, 2016) generates descriptive paragraphs
using only a subset of image regions
and with the use of a wider
vocabulary.
(Yang, Tang, Yang, & Li, 97 LSTM Visual VGGNet LSTM This novel model incorporates joint Sequential modeling needs https://github.com/linjieyangsc/densecap
2016) Genome inference and context fusion and enhancements for this
achieves state-of-art performance framework.
on Visual Genome
(Zha, Liu, Zhang, Zhang, & 59 CAVP Stanford VGGNet RNN Superior to RL-based methods and the models optimized by https://github.com/daqingliu/CAVP
Wu, 2022) Paragraph provides top-ranking performances CIDEr or BLEU would be
Dataset on MSCOCO and Stanford image superior over that by cross-
MSCOCO captioning datasets. entropy
(Wang, Luo, Li, Huang, & Yin, 21 CNN-LSTM + Visual – LSTM Strengthen image paragraph Takes a much longer time to
2018) Attention Genome captioning by enriching raw data generate paragraphs with
20
with extra geometric information diversity.

which also improves diversity
(Kyriazi, Han, & Rush, 2018) 29 SCST + Visual VGGNet LSTM This work increases diversity in Many language issues arise https://github.com/lukemelas/image-
Repetition Genome paragraph generation and provides for paragraph generation. paragraph-captioning
Penalty substantial improvement in state-
of-art techniques.
(Zhang, et al., 2019) – LSTM Visual VGGNet + LSTM Provides better regional features Complexity arises in the
Genome ROIAlign which promote better calculation of mAP (mean
implementation of region Average Precision)
positioning and descriptions.
(Xiao, Wang, Ding, Xiang, & 23 DSE-LSTMS MSCOCO VGGNet LSTM A bidirectional LSTM structure is MSCOCO is more challenging
Pan, 2019) Flickr30K ResNet used which captures previous and than Flickr30K in captioning
future contexts. TReLU can improve and retrieval tasks.
distinctness in the captions

(Cornia, Stefanini, Baraldi, & 173 M2 -Transformer MSCOCO ResNet Transformer Provide object details and small This model is slightly worse https://github.com/aimagelab/meshed-
Cucchiara, 2020) detections which achieve new state- on the ROUGE evaluation memory-transformer
of-art results on COCO. metric.
(Li, Liang, Shi, Feng, & Wang, 6 Dual-CNN Stanford VGGNet Region The model provides more efficiency For the encoder side, more https://github.com/bupt-mmai/CNN-
2020) Paragraph Attention by giving less training time and is semantic information like Caption
Dataset effective with a high CIDEr score positional relationships must
obtained. be considered.
(Kim, Oh, Choi, & Kweon, 2 MTTSNet Visual VGGNet LSTM This model facilitates POS-aware Suffers from visual https://github.com/Dong-JinKim/
2020) Genome relational captioning. This new ambiguity, geometric DenseRelationalCaptioning.
framework can open new ambiguity, and illumination.
applications.
(Guo, Lu, Chen, & Zeng, VTCM- Stanford CNN + RPN LSTM & This method captures the The complexity of the model –
2021) Transformer Paragraph Transformer correlation between image and text increases as this model also
Dataset
extraction of features, an attribute extraction model, an RNN for pre

diction of words, and a CNN-based model for the generation of language.
com/yahoo/object_relation_transformer
Extensive experiments were conducted on Flickr8K, MSCOCO, and
https://github.com/bupt-mmai/S2TD
Flickr30K which proves that the model proposed provides impressive
https://github.com/Dong-JinKim/
results for the generation of captions for images. Xian and Tian, (2019,
May). presented a self-guiding multimodal LSTM (sgl-LSTM) model that
https://github.com/yahoo/
object_relation_transformer
DenseRelationalCaptioning
handled an uncontrollable imbalance real-world image-sentence data
set. It is based on multimodal LSTM (m-LSTM) that deals with noisy
samples and can fully explore the dataset itself. The model outperformed
https://github.
the traditional RNN model-based technique (Chen & Zitnick, 2015)
GitHub Link
(Cheng, et al., 2017) in describing the key components of the input

images. (Chen & Zhuge, 2020) touched upon news image captioning
applications. News captioning is different from generic captions as news
images contain more detailed information. To understand and learn
Lacks richer visual encoding

news images, a multimodal attention mechanism is defined. It in
generates the topic for the
Without ROCSU, causes a

decrease in performance.
corporates a multimodal pointer generation network to extract visual
paragraph generated.
information. Experiments on the Dailymail test dataset and BBC test

dataset provide improvements in results in terms of BLEU, METEOR, and
ROUGE-L evaluation metrics.
Limitations
3.1.2.2.2. Encoder-Decoder Architecture-Based Image Captioning

Methods. Encoder-decoder-based image captioning methods comprise
encoder-based image understanding and decoder-based text generation
–
modules. These methods may further adopt either of the three namely:
at multiple levels of abstraction and
learns semantic topics from images.
Provides promising performance in

utilizing both visual and linguistic
(i) Encoder-decoder Architecture Without Attention Mechanism (ii) Attention

The proposed method generates
slightly higher METEOR scores.
Provides efficient encoding and

MTTSNet facilitates POS-aware
the generation of coherent and

knowledge of an image can be
mechanism-based techniques, and (iii) Semantic Concept-based techniques.

transferred into the language
diverse image paragraphs by
Generates both accurate and
decoding of both visual and
A detailed analysis of the pros and cons of different encoder-decoder-

Semantic and hierarchical
based methods is discussed in the following section.

relational captioning
3.1.2.2.3. Encoder-Decoder Architecture without Attention Mecha

diverse paragraphs
language features.
nism. The neural network-based image captioning methods work on the

domain easily
Advantages
information
principle of the end-to-end framework. Encoder-Decoder architecture

(Kalchbrenner & Blunsom, 2013)(Yang et al., 2016) (Cho et al., 2014)
was originally designed to translate sentences between different lan
guages. The idea behind adopting this method is to see the image
captioning technique as a translation problem but with different mo
Transformer
Generation
dalities. Machine translation architectures extract features from hidden

Structured
Decoder
activation of CNN passes to an LSTM to generate a sequence of words.

LSTM
LSTM
LSTM
LSTM
Tree-
Text
Encoder-Decoder architecture without attention is depicted in Fig. 13.

From this figure, two vital observations can be stated as follows:
Faster R-CNN
Faster R-CNN
Transformer
i) Relationships between the detected objects and the scenes can be

Encoder
VGGNet
VGGNet
derived by using a CNN model.

Image
CNN
ii) Image captions can be generated by using the output obtained in (1)
to a language model which gets converted into words and combined
phrases.
Captioning
Paragraph
Paragraph
Paragraph
Paragraph
Relational
Stanford
Stanford
Stanford
Stanford
Genome
Genome
Dataset
Dataset
Dataset
Dataset
Dataset
Dataset
(Kiros et al., 2014a) presented image captions generation word by

Visual
Visual
word. (Vinyals, Toshev, Bengio, & Erhan, 2015) defined Neural Image
Caption (NIC) generator that is similar to (Kiros et al., 2014a). It is based
on maximum likelihood estimation and proposes a CNN and an LSTM for
the representation of images and the generation of captions respectively.
This method suspected vanishing gradient problem because the infor
MTTSNet
Method
HSGED
mation related to images is fed only at the beginning of the process and
S2DT
ORA
TDC
words are generated based on the previous hidden state and the current
time step which continues until this process gets the end token of the
Citations
sentence. As the process continues the role of initial words becomes

weaker thereby degrading the quality of sentences generated. Hence, to
37
12
4
overcome the challenges of LSTM for long-length sentence generation

–
(Jia et al., 2015)presented an extension of LSTM known as guided LSTM

or gLSTM which successfully generated long-length sentences (Bahda
(Yang, Yang, & Hsu, 2021)
(Shao, Han, Marnerides, &

(Kim, Choi, Oh, & Kweon,
(Yang, Gao, Zhang, & Cai,
(Ilinykh & Dobnik, 2020)
nau, Cho, & Bengio, 2015) (Cho et al., 2014). While designing, different
length normalization strategies were considered to control the length of

Debattista, 2022)
(Shi, et al., 2021)
sentences. Various methods (multimodal embedding space) are being

adopted to extract semantic information from the generated captions.
(Donahue, et al., 2015) designed another variation for encoder-decoder
2020)
2019)
architecture that stacks multiple LSTM, also known as Long-term

Ref.
Recurrent Convolutional Network (LRCN) generates captions for static
21
Table 9
Overview of Change Image Captioning Task.
Encoder Generation
(Park, Darrell, & 43 DUDA CLEVR- ResNet RNN The model is robust to The spot-the-Diff dataset https://github.com/Seth-
Rohrbach, 2019) Change distractors in the sense is not the definitive test Park/
Spot- that it can distinguish for robust change RobustChangeCaptioning
the-diff relevant scene changes captioning as it does not
from illumination/ consider the presence of
viewpoint changes distractors
(Kim, Kim, Lee, Park, – VACC CLEVR- ResNet- LSTM The cycle consistency
& Kim, 2021) DC 101 module that evaluates
CLEVR- the quality of the
Change caption
Spot-
the-diff
(Shi, Yang, Gu, Joty, & 6 M− VAM CLEVR- ResNet- LSTM M− VAM encoder can Negative samples are
Cai, 2020) Change 101 accurately filter out the seen due to resizing
Spot- viewpoint influence operations; as a result,
the-diff and figure out the objects become too small
semantic changes from to be recognized.
the images
(Hosseinzadeh & 2 CLEVR- ResNet- LSTM This method composed
Wang, 2021) Change 101 query image retrieval
Spot- as an auxiliary task to
the-diff improve the primary
task of image change
captioning
(Tu, et al., 2021) – SRDRL CLEVR- ResNet- LSTM This method achieves This method needs https://github.com/
+ AVL Change 101 state-of-the-art improvement to learn tuyunbin/SRDRL
Spot- performances. more fine-grained
the-diff difference representation
(Qiu, Satoh, Suzuki, 4 Indoor 3D- ResNet- LSTM First attempt for indoor There is still room for –
Iwata, & Kataoka, CIC Dataset 101 scene change improvement, especially
Indoor Scene captioning. in object attribute
Change Captioning understanding
Based on
Multimodality Data,
2020b)
(Qiu, Satoh, Suzuki, 7 3D-CIC 3D- ResNet LSTM Three syntactic The dataset should –
Iwata, & Kataoka, Dataset datasets are created. contain more complex
3D-Aware Scene scenes, object models
Change Captioning with higher diversity,
From Multiview and placing object
Images, 2020a) models at various
locations in the scenes.
(Tu, Li, Yan, Gao, & – R3: Net CLEVR- ResNet- LSTM This method can This model does not https://github.com/
Yu, 2021) Change 101 explicitly distinguish consider very slight tuyunbin/R3Net
Spot- semantic changes from movements as the
the-diff viewpoint changes decoder receives the
weak information of
change
Fig. 9. Basic Block Diagram representation of Stylised Image Captioning.
22
Fig. 10. (a) Fundamental Steps Involved in Dense Image Captioning (b) An example to illustrate Dense Captioning & Paragraph Generation.
Fig. 11. Architecture of DUDA (Park, Darrell, & Rohrbach, 2019).
as well as dynamic images as well. In (Mao, et al., 2016), a special type decoder framework, unified Graph Convolutional Network with LSTM
of text generation method was defined to generate a specific object or to define the semantics and spatial object relationships in image
region description known as referring expression using unidirectional encoder. Extensive experiments were carried out on the COCO dataset
LSTM. The generated referring expressions help infer ambiguity in ob and better results were obtained which remarkably increases CIDEr-D
ject/scene representation. Recent techniques for the detection and from 120.1 to 128.7.
classification of objects (Simonyan & Zisserman, 2014) Krizhevsky et al., The work (Ren et al., 2017) introduced a novel decision-making
(2017, June). exhibited that the deep hierarchal method performs better framework for image captioning. It utilized a “policy network” and a
than shallower ones. (Wang, Yang, Bartz, & Meinel, 2016) defined a “value network” to collaboratively generate captions. (Dai, Fidler,
deeper Bi-directional LSTM framework to generate semantically rich Urtasun, & Lin, 2017) presented Conditional Generative Adversarial
sentences. It utilized both past and future context information that Networks (CGAN) that aim to improve the naturalness and diversity of
helped in learning long-term visual language interactions. (Wu et al., generated captions. This model jointly learns a generator to produce
2016) incorporated high-level semantics concepts into encoder-decoder descriptions conditioned on images and is an evaluator to assess how
architecture through the combination of CNN and RNN in the form of well a description fits with the visual content. (Chen et al., 2019a)
attribute probabilities. This method achieved a significant improvement presented an extension of traditional reinforcement learning-based ar
in the state-of-the-art for both image captioning and visual question chitecture. It dealt with the inconsistent evaluation problem and
answering. (Pu et al., 2016a) opted for a semi-supervised learning distinguished whether generated captions are human-generated or
method to train Deep Generative Deconvolutional Network (DGDN) (Pu machine-generated with the use of a discriminator (CNN or RNN-based
et al., 2016b) as a decoder and a deep CNN as an encoder. It can model structures). (Rennie et al., 2017) considered the problem of optimizing
the image in absence of associated captions, thus known as semi- image captioning systems using reinforcement learning. He worked on a
supervised. (Yao et al., 2018) discussed attention-based encoder- form of REINFORCE algorithm and presented Self-Critical Sequence
23
Fig. 12. Illustration of Fusion Approach for Multimodal Image Captioning (Zhao, Chang, & Guo, 2019).
Fig. 13. Basic Principle Involved in Encoder-Decoder Architecture-Based Image Captioning.
Training (SCST) for image captioning. This method provided an For Image captioning, CNN-based encoders (Kiros et al., 2014a) have
improvement in the CIDEr score from 104.9 to 114.7. Another work, been used to extract visual features for the image and an RNN-based
(Chen et al., 2018c) described captioning of images with the use of decoder to convert the extracted visual features into natural language.
REINFORCE algorithm. This method enables the correlation between Encoder-decoder-based methods (Chen et al., 2019a) (Herdade et al.,
actions to be learned. Existing encoder-decoder-based methods suffer 2020) are unable to analyze the image over time. Also, such methods do
from two major problems, (i) all the words of captions are treated not consider the spatial aspects of an image that are relevant for the
equally without considering the importance of different words, and (ii) captioning of images instead these methods generate captions for the
in the caption generation phase, the semantic objects or scenes might be scene as a whole. To overcome these limitations Attention mechanism-
misrecognized. To overcome these issues Reference-based LSTM (R- based captioning of images came into existence. Attention mecha
LSTM) (Ding, et al., 2019) is presented to generate more descriptive nisms are now being widely applied which yields significant improve
information with the use of reference information. The work (Herdade ments for various tasks. The essential function of the attention
et al., 2020) used an object relation transformer with an abstract feature mechanism is to map the text description of the image to different re
vector to provide a spatial relationship between input-detected objects. gions of the image. For the attention-based captioning model, Ct is the
This approach exhibited improvements in all captioning metrics for the feature map extracted from the region after CNN which is defined by:
MSCOCO dataset. (Patwari & Naik, 2021) combined CNN with an
ct = g(V,ht (1)
attention-based Gated Recurrent Unit (GRU) to generate a better
description for a given image. It is less complex and easy to train for Where,g is the attention mechanism, V = [v1 , v2 , ⋯⋯⋯vk ] is the
about 100 epochs. Mishra et al., (2021, June). presented a CNN-RNN- vector representing the image features corresponding to k region of the
based encoder-decoder network for image captioning in the Hindi image and ht is the hidden state of RNN at time t. Attention distribution
Language by manually translating the popular MSCOCO dataset from of k regions bt is:
English to Hindi. It provided state-of-art results with a BLEU-1 score of ( ( ) )
62.9, BLEU-2 score of 43.3, BLEU-3 score of 29.1, and BLEU4 score of b̃t = wTh tanh Wv V + Wg ht I T (2)
19.0. Image captioning is used in several applications as mentioned ( )
earlier. It can also be used to capture eating episodes of the subject and bt = softmax b̃t (3)
further record rich visual information. (Qiu, et al., 2021) defined a
framework for passive dietary intake monitoring. It is known as From equations (2) and (3) final ct is given by:
egocentric image captioning which unifies food recognition, volume ∑
k
estimation, and scene understanding. The work (Qiu, et al., 2021) is the ct = bti vti (4)
first-ever work in the field of egocentric image captioning. i=1
3.1.2.2.4. Encoder-Decoder Architecture with Attention Mechanism. These methods are becoming increasingly popular in deep learning
24
for image captioning. Various attention mechanism methods discussed Verbeek, 2017) when combined with the spatial transformer network
in this paper include (1) Soft-Attention (Anderson, et al., 2018), (2) produced high-quality image captions for the MSCOCO dataset.
Spatial and channel-wise attention (Chen et al., 2017), (3) adaptive (Lu Most attention mechanism-based methods forced visual attention to
et al., 2017) Deng et al., (2020, July). (Sharma, Dhiman, & Kumar, be active for each generated word. There are some words in the gener
2022) (4) multi-head and self-attention (Zhang, Wu, Wang, & Chen, 2021) ated captions for which visual attention is not required like “a, the, of”
(5) Semantic Attention (You, Jin, Wang, Fang, & Luo, 2016). The basic do not require any visual attention as this could affect the generation
model for the attention-based mechanism is shown in Fig. 14 wherein process and can also degrade the overall efficiency of the process. (Lu
CNN is used to extract the information from the input image. The in et al., 2017) talked about an adaptive attention model with a visual
formation so extracted is fed to the language generation part. It gener sentinel. It employed a spatial attention mechanism to extract spatial
ates i words or phrases based on the information. The hidden state of features followed by adaptive attention for visual sentinel. In the work
LSTM hi , used to select the relevant part of the image. zi , The output of Deng et al., (2020, July). a combination of DenseNet and adaptive
the attention model is used as an input to LSTM for extraction of salient attention with visual sentinel was presented. In this model, DenseNet
features of the image focused in each time step of the language gener was used to extract the global features from an image while at the same
ation model. The generated captions are updated dynamically until the time, the sentinel gate was set by an adaptive attention mechanism that
end of the language generation model. decided whether the image feature information should be used for word
(Xu et al., 2015b) were the first to introduce the concept of attention- generation or not. LSTM network helped in the decoding phase for the
based captioning of images. It described the salient contents of an image generation of captions. (Liu, 2017) presented an idea of neural image
and generated corresponding words for the salient parts at the same captioning which evaluated and corrected the attention map at time
time. The method is based on stochastic hard attention and deterministic steps. This method made a consistent map between image regions and
soft attention for generating captions. The use of an attention-based the words generated which can be made possible with the introduction
mechanism in captioning of images provided improvement in BLEU of a quantitative evaluation metric. Experiments carried out on
and METEOR metrics. (Jin, Fu, Cui, Sha, & Zhang, 2015) discussed Flickr30K and MSCOCO datasets showed prominent improvements in
another method in the category of attention-based mechanism which both attention correctness and quality of captions generated. CNN
extracted the flow of abstract meaning based on the semantic relation dubbed SCN-CNN (Chen et al., 2017) considered spatial and channel-
ship between visual and textual information. This model focused on wise attention for the computation of the attention map. This method
scene-specific content for the extraction of high-level semantic infor modulates the sentence generation context in multi-layer feature maps
mation. The novelty of the model (Jin, Fu, Cui, Sha, & Zhang, 2015) lies and visual attention. SCN-CNN architecture is evaluated on three
in the fact that this method introduced multiple visual regions of the benchmark datasets Flickr8K, Flickr30K, and MSCOCO which signifi
image at multiple scales. Extensive experiments were carried out on cantly outperformed their counterparts.
three benchmark datasets: MSCOCO, Flickr8K, and Flickr30K which Bottom-up saliency-based attention methods (Anderson, et al., 2018)
justified the superiority of this technique. () presented a review-based (Tavakoli, Shetty, Borji, & Laaksonen, 2017) are beneficial in reducing
attention mechanism technique. The work performed multiple review the gap between human-generated and machine-generated descriptions.
steps with attention to various CNN hidden states to define the output (Tavakoli, Shetty, Borji, & Laaksonen, 2017) worked on a bottom-up
vector that provided global facts of the image. For example, a reviewer saliency-based attention mechanism. This method proved that the bet
module can first review: What sort of objects is present in the image? ter a captioning model performs. the better an attention agreement it has
Then it can review the location or position of objects and subsequent with human descriptions. Also, this method provided better results on
review can extract all the important information of an image. The in unseen data. (Anderson, et al., 2018) proposed a method that used both
formation obtained can further be passed to a decoder for caption gen top-down and bottom-up approaches which can attend both object-level
eration. (You, Jin, Wang, Fang, & Luo, 2016) presented a semantic regions and salient image regions. It provided better results than
attention-based technique that combined both top-down and bottom- (Tavakoli, Shetty, Borji, & Laaksonen, 2017) as the bottom-up mecha
up approaches to selectively extract semantic features followed by nism in (Anderson, et al., 2018) used faster R-CNN which provides better
conversion into captions. (Pedersoli, Lucas, Schmid, & Verbeek, 2017) results on the MSCOCO dataset. Context Sequence Memory Network
proposed an area-based attention mechanism for caption generation. (CSMN) (Park, Kim, & Kim, 2017) is different from previously discussed
This method was directly associated with caption words and image re techniques (Anderson, et al., 2018) (Tavakoli, Shetty, Borji, & Laakso
gions and predicted the next word as well as the corresponding image nen, 2017). It generated captions for images by extracting context fea
region in each time step. The work (Pedersoli, Lucas, Schmid, & tures with the use of “hashtag prediction” and “post generation”.
Fig. 14. Basic Structure for Attention-Based Image Captioning.
25
Another work by (Chen et al., 2018b) presented an attribute-driven detailed and coherent description of semantically important objects. The
attention model for image captioning which maintained the co- basic structure of semantic concept-based captioning of images is pre
occurrences dependencies among attributes. Attention on Attention sented in Fig. 15. Generally, CNN based encoder is used to extract fea
(AoA) (Huang, Wang, Chen, & Wei, 2019) framework extended the tures and semantic concepts. The extracted features are fed into a
conventional attention mechanism to determine the relevance between language generation model and the semantic concepts are fed to
attention results and queries. AoA is applied to both the encoder and different hidden states of the language model which further produces a
decoder of the image captioning model called AoANet which provided a description of images with semantic concepts.
new state-of-the-art performance on the MSCOCO dataset with a CIDEr- (Karpathy & Fei-Fei, Deep Visual-Semantic Alignments for Gener
D score of 129.8.Liu et al., (2020, March). utilized the concept of dual ating Image Descriptions, 2015) combined CNN and bi-directional RNN
attention. It combined visual and textual attention for image caption to generate captions. Automatic description of an image with natural
generation. (Liu, Hu, Li, Yu, & Guan, 2020) proposed NICVATP2L model language is a challenging task in the field of computer vision and natural
that incorporated visual attention and topic modeling. It reduced the language processing. Attributes of an image are considered rich se
deviation and the topic model improved the accuracy and diversity of mantic cues. (Yao et al., 2017) proposed LSTM with attributes (LSTM-A)
generated sentences. Visual and topic features help to generate more that integrated attributes into CNN and RNN image captioning frame
natural, informative, and descriptive captions in Chinese. (Zhang, Wu, work by training them in an end-to-end manner. The architecture was
Wang, & Chen, 2021) explored visual relationships between regions in tested on the MSCOCO dataset and thus obtained METEOR and CIDEr-D
an implicit way that helped provide alignment between caption words scores of 25.2% and 98.6% respectively. (You, Jin, Wang, Fang, & Luo,
and visual regions. (Xiao, Xue, Shen, & Gao, 2022) proposed Attention- 2016) provided detailed and coherent object descriptions by using top-
based LSTM for image captioning which attended more relevant features down and bottom-up approaches for generating captions. (Xu et al.,
and paid more attention to the most relevant context words. Another 2015b) further work on fixed and predefined spatial locations. This
work (Yan, et al., 2021) defined the task adaptive attention concept for method can work on any resolution and on any location of an image with
image captioning. It generated non-visual words by introducing di the use of a feedback process that accelerates to generate better captions
versity regularization and enhanced the expression ability of the pro for images. The earlier discussed methods (Yao et al., 2017) (You, Jin,
posed module. Improvement in performance for the MSCOCO dataset Wang, Fang, & Luo, 2016) do not include high-level semantic concepts.
was observed by plugging the task-adaptive module into a vanilla Wu et al., (2018, March). discussed a high-level semantic-based
transformer-based image captioning model. Many image captioning captioning model. It extracted attributes using a CNN-based classifier
models focused on single-feature extraction and lack detailed de and the extracted attributes were used as high-level semantic objects to
scriptions of the image content. To address this limitation, (Lu, et al., generate semantically rich captions. An analysis is carried out on
2021) presented a fuzzy attention-based DenseNet-BiLSTM Chinese MSCOCO and large-scale Stock3M datasets that provided consistent
image captioning. This model extracted image features at different improvements, especially on the SPICE metric. (Gao et al., 2018) pre
scales and enhanced the ability of the model to describe the contents of sented a novel scene-graph-based semantic representation of an image.
the images. It built a vocabulary of semantic concepts and the CNN-RNN-SVM
Recently, (Zeng et al., 2022) S2 -Transformer for image captioning framework was used to generate the scene-graph-based sequence.
implicitly learnt pseudo regions through a series of learnable clusters. Another work (Tripathi, Nguyen, Guha, Du, & Nguyen, 2021) defined
The experimental results on MSCOCO dataset confirmed the effective SG2Caps which utilizes the scene-graph labels for competitive image
ness and interpretability of the proposed method. Further, a Trans captioning performance with the basic idea to reduce the semantic gap
former based architecture (Liu, et al., 2022) seamlessly implemented between the graphs obtained from an input image and its caption. The
causal intervention into both object detection and caption generation. framework proposed by (Tripathi, Nguyen, Guha, Du, & Nguyen, 2021)
Existing transformer-based methods with self-attention do not capture outperformed (Gao et al., 2018) by a large amount and indicates scenes
well the semantic association between objects with weak appearance as promising representations for captioning images. High-level semantic
features. To overcome this issue the work (Jia, Wang, Peng, & Chen, information provided abstractedness and generality of an image which
2022) presented Semantic Association Enhancement Transformer is beneficial to improve performance. Liu and Xu, (2020, Dec). gener
(SAET) to capture the semantic association between appearance features ated logical and rich descriptions of images by fusing the image features
of the candidate object and query object. and high-level semantics followed by a language generation model. (Shi
3.1.2.2.5. Encoder-Decoder Architecture with Semantic-Concept. Se et al., 2020b) presented a novel architecture for caption generation to
mantic-concept-based image captioning is the ability to provide a better explore semantics available in captions. The model constructed
Fig. 15. Basic Block representation of Semantic-Concept-based Image Captioning.
26
Fig. 16. An illustrative example of compositional architecture-based image captioning.
caption-guided visual relationship graphs and it further incorporated a 2019) combined caption generation and image-sentence ranking in this
visual relationship to predict the word and object/predicate tag se direction. The model (Nikolaus et al., 2019) used a decoding mechanism
quences. The Element Embedding LSTM (EE-LSTM) (Zhang, et al., 2020) that re-ranked the captions according to their similarity to the image.
generated rich descriptions of semantic features. The model performed better when compared with other state-of-the-art
3.1.2.2.6. Compositional-Architecture-Based Image Captioning Meth methods. Image captioning based on the compositional-architecture
ods. Compositional architecture is an alternative means to build an model has gained popularity because of its fluency which is an impor
image captioning model, that connects independent and loosely coupled tant factor for evaluation. (Tian & Oh, 2020) presented a hierarchical
components through a pipeline. It involves the following steps: (i) Visual framework that generated accurate and detailed captions of images by
features extraction using a CNN (ii) Visual concepts (i.e, attributes) exploring both compositionality and sequentially of natural language to
encoding from visual features. (iii) Multiple captions generation from produce detail-rich sentences with specific descriptions of objects such
visual concepts (iv) re-ranking of generated captions using a deep as color, count, etc. Captioning of images focuses only on generalizing to
multimodal similarity model to select high-quality image captions. images from the same distribution as the training set and not on
Fig. 14 illustrates an example of a compositional architecture-based generalizing to the different distributions of images. (Bugliarello &
method that provides data from each component until a final result is Elliott, 2021) investigated different methods to improve the composi
obtained. (Fang, et al., 2016) used visual detectors, a language model, tional generalization for the syntactic structure of a caption and provide
and a multimodal similarity model to generate captions including performance improvements.
different parts of speech like nouns, verbs, and adjectives. The model
provides better results on the MSCOCO dataset and produced a BLUE-4 3.1.3. Discussion
score of 29.1%. (Tran, et al., 2016) attempted to generate captions for The discussion in the above sub-sections 3.1.1 and 3.1.2 portrays that
open-domain images (Instagram images). This model generated captions each image captioning approach has its particular strength and weak
for landmarks and celebrities by detecting a diverse set of visual con ness. Traditional techniques (Ordonez et al., 2011) (Yang, Teo, Daume,
cepts. Description images with adaptive adjunct words generate more & Aloimono, 2011) have limited capabilities as the generated captions
informative sentences. In this direction, (Ma & Han, 2016) proposed a combine image features and language models instead of matching with
compositional network-based method to generate captions in the form of the captions that already exist. These methods fail to describe the words
structured words < object, attribute, activity, scene >. It used multi-task outside the training data. To overcome this limitation deep-learning-
and multi-layer optimization methods to generate semantically mean based techniques are proposed for the generation of automatic image
ingful sentences with structural words (Wang, Song, Yang, & Luo, 2016) captions in recent years. (Kiros et al., 2014a) defined multimodal-
combined the advantages of RNN and LSTM and defined parallel fusion learning architecture using a combination of CNN and LSTM for text
RNN-LSTM architecture which performed better than the other state-of- generation with superior BLEU scores. (Karpathy et al., 2014) intro
the-art (Fang, et al., 2016) (Tran, et al., 2016). (Tan et al., 2019) defined duced a bidirectional model for the image-sentence retrieval task. The
the Text2Scene concept which generated descriptions for a composi work (Patwari & Naik, 2021) used an encoder-decoder structure to
tional scene in form of natural language. This model is not based on GAN capture relationships between detected objects and scenes for the gen
and provides better and superior results when compared with GAN- eration of enhanced descriptions of RGB images. Whereas, (Qiu, et al.,
based methods for the generation of captions for images. It can handle 2021), for the first time, attempted to generate captions for egocentric
different types of images i.e., cartoon-like scenes, object layouts corre images and judge the type of food the subject is eating, the food portion
sponding to real images, and synthetic images for caption generation. size, and whether the subject is sharing food from the same plate or bowl
Compositional-architecture-based models (Nikolaus et al., 2019) (Tian with other individuals or not. It was indeed a significant achievement in
& Oh, 2020) are studied to measure how well a model composes unseen detailing the minute details of the scene. To describe unseen combina
combinations of concepts while describing images. (Nikolaus et al., tions of concepts of complex images Compositional-architecture-based
27
models (Nikolaus et al., 2019) (Tan et al., 2019), as reported in the (Srivastava, Mansimov, & Salakhudinov, 2015) based models are
literature, have produced a revolutionary result. It has greatly helped in generally used to describe a video in a single sentence. Whereas, DVC
reducing the semantic gap between real-world image captioning and generates multiple natural language sentences connected densely. In
synthetically generated captions. this process, the events need to be localized in time (Krishna et al., 2017)
Attention-based models (Zhang, Wu, Wang, & Chen, 2021) Deng (Zhou et al., 2018a) . Further, the descriptions generated in dense
et al., (2020, July). have successfully surpassed the performance of captioning are more detailed and may be in form of a paragraph.
compositional architecture-based models by highlighting salient parts of
the image and encrypting the interactions between objects and the 3.2.1. Single Sentence Video Captioning (SSVC) Techniques
scene. These models have reported superior values of BLEU and An automated Single Sentence Video Captioning (SSVC) generates
METEOR scores than encoder-decoder-based deep models. (Yan, et al., the contents of the whole video in the form of a single sentence. SSVC is
2021) described non-visual words using the task adaptive attention broadly divided into types, template-based, ML-based, and Deep-
module. The works Liu and Xu, (2020, Dec). (Shi et al., 2020b) Learning based methods. The following subsections provide a com
(Agrawal, et al., 2019) (Feng, et al., 2019) generated semantically rich plete description of all three types.
captions by focusing on different parts of images. Understanding emo
tions in an image is an interesting application of image captioning 3.2.1.1. Template-Based Methods. The research in the field of video
named stylized captioning. Few researchers have also worked in this captioning started with the traditional approach which is also known as
field of research and expressed various emotions like romance, shame, the template-based video captioning technique. This technique first
and humor with desired linguistic style appropriately. Dense image detects all the related actions, entities, and events for a video followed
captioning is emerging as a reliable solution to generate diverse and by the creation of templates for the generation of suitable descriptions.
multiple captions (Johnson et al., 2016) (Yang, Tang, Yang, & Li, 2016) These descriptions are grammatically correct but, this technique lack in
resulting in visually pleasing results. (Kim et al., 2020) described POS- the generation of variable-length sentence descriptions. It detects the
aware relational captioning model, however, it suffers from visual and subject, verb, and object (SVO) in a video, and places them into pre-
geometrical ambiguity and illumination. Furthermore, a budding field defined templates. Fig. 19 (a) represents the examples of some popular
under image captioning known as CIC identifies a change and describes templates used for the generation of sentences where Spatio-temporal
it in an incisive manner is discussed. (Park et al., 2019) defined the features are preferred to obtain verbs and object detection methods
DUDA model based on attention mechanism rather than pixel-level are used to obtain subject and object. In SVO verb is obtained by action
difference. The works (Qiu et al., 2020) Qiu et al., (2020, August). or activity detection methods that use spatiotemporal features whereas,
presented a framework that described the changes from multiple view the subject and object are obtained from object detection methods with
points (or 3-D vision) in form of natural language. the help of spatial features. Most template-based approaches are
composed of: (i) visual attribute identification and, (ii) natural language
description generation. The sentences are generated with syntactical
3.2. Video Captioning Techniques structures and their quality highly depends on the templates of the
sentences. (Kojima, Tamura, & Fukunag, 2002) described human ac
Video captioning, an automated collection of the content of videos, tivities from videos using the concept of hierarchies of actions. It
provides a detailed account of visual information possessed in videos. generated action descriptions using semantic rules and a word dictio
The exponential growth of video captioning in near future will make us nary. The model (Hakeem et al., 2004) addressed the issue of multiple
interact with robots in the same way we interact with humans (Vries, dependent and independent events in a video using a case list of sub-
Shuster, Batra, Weston, & Kiela, 2018). Visually impaired will get events. (Tena, Baiget, Roca, & Gonzàlez, 2007) generated descriptions
benefitted from verbal descriptions of their surroundings (Jain et al., of humans in three particular languages namely Catalan, English, and
2018). However, it is not as easy as it seems. It involves, as depicted in Spanish. This framework is divided into three parts: (i) detection and
Fig. 17 a lot of understanding of video content and its grammatical tracking of objects in a video. (ii) Conversion of structural knowledge
representation frame by frame. Video captioning is broadly categorized obtained in (i) into logical knowledge. (iii) Description of human actions
into two categories, as shown in Fig. 18: (i) Single Sentence Video from logical knowledge. (Lee, Hakeem, Haering, & Zhu, 2008) presented
Captioning (SSVC), and (ii) Dense Video Captioning (DVC). SSVC involves retrieval of video contents based on automatic Semantic Annotation of
a description of the whole video with one sentence, which may not be Visual Events (SAVE) that contains three stages, namely image parsing,
sufficient to describe the whole video. CNN (Yao, et al., 2015) or RNN
Fig. 17. Basic Structure of Video Captioning Problems.
28
Fig 18. Taxonomy of Video Captioning methods.
Fig 19. (a) Popular templates used in the generation of sentences (b) Some Popular predicates used in the generation of sentences.
event inference, and text generation. SAVE provided more information better opportunities to deal with larger datasets than template-based
regarding visual entities present in the scene when compared with models. It employs better strategies for image processing, activity, and
others (Hakeem et al., 2004) (Tena, Baiget, Roca, & Gonzàlez, 2007). object detection. The work (P, k. , 2009) defined a Statistical Machine
(Khan, Zhang, & Gotoh, 2011) predicted the gender of humans in a Translation (SMT) open-source toolkit supporting effective data formats
video by face recognition and actions performed by a human by for language and translational models. SMT has a single phrase table and
employing a template-based approach. Fig. 19 (b) presented the sample deals only with the surface form of words. An overview of SMT, to treat
predicates that are used for the generation of sentences. (Babru, et al., translation of natural language is provided in (Lopez, 2008) and a
2012) generated captions for a video that involved the interaction of mechanism is defined for object recognition which is analogous to ma
more than one person. It involved object detection, object tracking, and chine translation. (Rohrbach, et al., 2013) presented a method for the
dynamic programming. The model (Babru, et al., 2012) is of less generation of descriptions of videos for the case of factored translations.
importance because of its poor training due to limited objects and This model provided descriptions of each word in terms of variables as
entities. provided by POS tags or lemma. (Kojima, Izumi, Tamura, & Fukunaga,
2000) estimated the three-dimensional pose and position of the head
3.2.1.2. Machine learning (ML)-Based Methods. ML-based models for and evaluated conceptual features using colored histograms. (Tan et al.,
video captioning make visual tracking and recognition more robust. 2011) worked on compact textual descriptions for complex video con
These methods(Williams, 1992) (Lopez, 2008)(P, k. , 2009) provided tent using a rule-based approach for classification. (Das, Xu, Doell, &
29
Corso, 2013) represented relevant contents of a video with a combina 3.2.1.3.1. Sequence-to-Sequence Models with and without Attention
tion of top-down and bottom-up approaches. Another work (Das, Sri Mechanism. Encoder-Decoder architecture is the basis for sequence-to-
hari, & Corso, 2013) presented a framework for modeling videos that is sequence-based video captioning models that mainly consist of two
flexible enough to handle different types of document features such as parts: (i) extraction of features from video and (ii) generation of natural
discrete and real-valued features from videos representing actions, ob language sentences. LSTM (Hochreiter & Schmidhuber, 1997) has been
jects, colors, and scenes as well as discrete features from the text. This popularly used as the sequential model for language generation. Video
model proved to be the best fit for multimedia data. (Guadarrama, et al., captioning problems involve variable-length inputs and outputs which
2013) presented YouTube2text to describe and recognize arbitrary ac can be resolved by considering the same-size input videos. But in a real-
tivities. It was evaluated on a large YouTube corpus that somehow world scenario, it is important to handle variable-length inputs. This
generated short descriptions of video clips in a better way when issue has been presented by many researchers in different ways like the
compared with the baseline approaches. The model proposed by (Gua generation of fixed-length video representations (Rohrbach, et al., 2013)
darrama, et al., 2013) is based on (Deng, Krause, Alexander, & Li, 2012) (Guadarrama, et al., 2013), pooling over frames (Venugopalan, et al.,
that provided an optimal solution to the problem with high accuracy. 2014), selection of a fixed number of input frames from the input video,
(Thomason et al., 2014) described the scenes present in videos with and handling variable-length input. Fig. 21 (a) depicts the model ar
more accurate and richer sentimental details. The method (Senina, et al., chitecture for training video frames in a sequence-to-sequence manner.
2014) contributed to generating more coherent video descriptions with It consists of a double-layered LSTM structure that describes events in a
variable levels of details by imposing across-sentence consistency at video, whereas, Fig. 21 (b) illustrates a similar structure except for an
each level of semantic representation. (Xu et al., 2015c) presented a additional attention layer that is essential to boost the performance of
unified framework that accomplished modeling deep videos and the generated descriptions. (Venugopalan, et al., 2014) presented a
compositional text sentences to bridge vision and language jointly in sequence-to-sequence model based on LSTM that translated videos
three steps (i) natural language generation, (ii) video retrieval and, (iii) directly into sentences. This method used a unified deep neural network
language retrieval. ()(Yu and Siskind, 2015) described a model with that consisted of both convolutional and recurrent structures and
weak supervision that exploited negative sentential information. This generated captions of open-domain videos with a large vocabulary.
model trained words by training positive sentential labels against Deep Neural Networks models are dynamic with enhanced performance.
negative ones in a weakly supervised manner. (Sun and Nevatia, 2014) So, a general end-to-end approach is discussed in (Sutskever, Vinyals, &
generated text in videos in the form of SVO. It is based on Semantic Quoc V. Le, 2014), which defined a multi-layered LSTM to map an input
Aware Transcription (SAT) framework with a random forest classifier sequence to a vector, followed by another deep LSTM to decode the
that determines the relationship between current visual detectors and target sequence from the vector.
words in semantic space. Another work (Wang et al., 2013) enhanced Earlier video captioning models generated sentences that are
the discriminative ability of motion atoms in complex videos by incor conceptually correct but not semantically correct. (Pan, Mei, Yao, Li, &
porating temporal constraints on a larger scale. Rui, 2016) addressed this issue by defining a novel unified framework
known as LSTM-E that explored the learning of LSTM and visual-
3.2.1.3. Deep-Learning-Based Methods. The methods discussed in sec semantic embedding. LSTM-E model has three components: (i) a deep-
tions 3.2.1.1 and 3.2.1.2 showed a pragmatic way to generate captions CNN for representation of videos, (ii) a deep-RNN for generation of
for videos as the sentences generated are very rigid and lack vocabulary. sentences, and (iii) a joint embedding model to explore the relationship
Therefore, SVO-based methods are inadequate for open-domain data between visual content and sentence semantics. This technique, when
sets. Deep-Learning-based video captioning techniques deal with algo compared with (Sutskever, Vinyals, & Quoc V. Le, 2014) (Venugopalan,
rithms based on neural networks which recognize actions (Donahue, et al., 2014) provided encouraging results in determining the SVO
et al., 2015)(Z, W., T, Y., Y, F., G, J. Y., 2016) (Feichtenhofer et al., 2017) triplets (Babru, et al., 2012). (Xu et al., 2015c) presented a model similar
and retrieve information and justified the effectiveness of neural net to (Pan, Mei, Yao, Li, & Rui, 2016) that jointly modeled video and
works. These algorithms are beneficial for learning representations corresponding text sentences. and accomplished natural language gen
without requiring directly extracted features from the input data. The eration with SVO prediction. (Venugopalan, et al., 2015) designed a
basic structure representation of deep-learning-based video captioning sequence-to-sequence-Video to Text (S2VT) method. It allowed learning
is depicted in Fig. 20 which consists of two stages namely: (i) encoding of the temporal structure of video frames as well as the sequence model
stage (visual content extraction) and, (ii) decoding stage (language for the generated sentences. S2VT directly maps a sequence of frames to
generation stage).The popular techniques used to extract the features of a sequence of words. (Xu et al., 2015a) presented a combination of CNN
a particular video include CNN (Donahue, et al., 2015) Bin et al., (2018, and RNN that simultaneously learn and extract useful high-level con
May)., RNN (Pan, Xu, Yang, Wu, & Zhuang, 2016), or LSTM (Pan, Mei, cepts for the generation of natural language. Medium and small-scale
Yao, Li, & Rui, 2016)(Zhang et al., 2020). CNN-based architectures concepts from a video frame can be extracted in a Multi-scale Multi-
provided state-of-the-art of modeling for the representation of visual instance Video Description Network (MM-VDN) that integrates a Fully
data, whereas LSTM and RNN are setting new benchmarks in the field of Convolutional Network (FCN) for the same. The model (Xu et al., 2015a)
Machine Translation (MT). The techniques used in the first stage can is efficient and extensible and is especially suited for the processing of
also be used for the generation of language or the decoding stage with videos where the region proposal generation mechanism is slow. (Ven
different structures of RNN like bi-RNN, LSTM, or GRU. AlexNet Kriz ugopalan et al., 2016a) investigated multiple techniques to incorporate
hevsky et al., (2017, June)., VGG-16 (Simonyan & Zisserman, 2014), linguistic knowledge from text corpora to aid video captioning that
SPP (He et al., 2014), GoogleNet (), etc. are various pre-trained networks showed significant improvements in human evaluations of grammar. It
that can be used for this task. A comprehensive survey of various deep- achieved a substantial improvement in the descriptive quality of sen
learning methods for video captioning is presented by (Islam, et al., tences by conceptualizing a deep fusion of language model and video
2021). In the following subsections, all the earlier works as depicted in features. (Jin & Liang, 2016) used both visual and auditory information
Fig. 20 are reviewed in detail. Also, an overview of all methods for deep- by utilizing an LSTM-RNN to model sequence dynamics. It is directly
learning-based video captioning is presented in Tables 10-14 that connected to a CNN and an acoustic feature extraction module that
highlight their pros and cons, societal impact as the number of citations, processed incoming video frames for visual and acoustic encoding.
architectures used, datasets experimented on, and Github link. Experimental results on MSDV proved that fusion of audio information
30
Table 10
Sequence to Sequence Video Captioning with or without Attention Mechanism.
Ref. Citations Method Dataset Encoding Text Advantages Limitations GitHub Link
Stage Generation
(Bin, et al., 2018) 140 biLSTM + MSVD VGGNet LSTM Superior performance ROGUE-L and https://github.com/
Soft MSR-VTT over other state-of- CIDEr are not SeoSangwoo/Attention-
attention arts methods and considered metrics Based-BiLSTM-relation-
captures more extraction
semantics
(Long, Gan, & 63 LSTM + MSR-VTT LSTM Outperforms previous Difficulty in –
Melo, 2016) Multi- MSVD work by a large exploring multi-
faceted margin with or modal features like
Attention without semantic visual and audio
attributes. Efficient which includes
and robust tags, titles, and
comments
(Pan, Yao, Li, & 277 LSTM-TSA MSVD VGGNet LSTM Improvement in Complex analysis
Mei, 2016) M− VAD C3D performance as
MPII-MD compared to other
state-of-art methods,
large performance
gains which lead to
better performance.
(Yang, et al., 2018) 113 LSTM-GAN MSVD VGGNet LSTM-GAN Improvement in Performance is https://github.com/
MSR-VTT quality and diversity further improved yiskw713/
M− VAD of captions due to with VideoCaptioning
MPII-MD better control of the Reinforcement
capture generation Learning
process.
(Gao, Guo, Zhang, 402 LSTM + MSVD InceptionV3 LSTM Minimize the Training datasets https://github.com/
Xu, & Shen, Attention MSR-VTT relevance loss and are incomplete and zhaoluffy/aLSTMs
2017) semantic cross-view require refinement
loss and expansion
(Yao, et al., 2015) 1003 CNN-RNN MSVD Can capture local Dataset MPII https://github.com/yaoli/
+ Temporal DVS fine-graded motion similar to DSV can arctic-capgen-vid
Attention information from improve the results
consecutive frames significantly.
(Chen J., et al., 52 TDConvED MSVD VGGNet, RNN Improvement in Performance issues https://github.com/
Temporal MSR-VTT C3D, results when are there which b05902062/TDConvED
Deformable ResNet compared with RNN- can be resolved
Convolutional based techniques and with temporal
Encoder-Decoder CNN + RNN-based attention.
Networks for models
Video
Captioning,
2019)
(Zhao, Li, & Lu, 13 biLSTM + MSVD VGGNet, LSTM The object-aware Visual information –
2018) Attention Chardes ResNet model can capture the of small objects is
trajectories of objects hard to detect for
in videos. both frame features
and region features
for Chardes
(Cho, Courville, & 374 CNN + MSVD Superior performance Complexity –
Bengio, 2015) Attention and attention increases in
Mechanism mechanisms can comparison to
extract underlying seq2seq models.
mapping between
two different
modalities
(Nian, et al., 2017) 19 VRM MSVD VGGNet, LSTM High-level – –
(LSTM) M− VAD ResNet, representation of
MPII-MD video attributes leads
to an improvement in
performance
(Tu, Zhang, Liu, & 24 STAT MSVD, MSR- C3D, LSTM STAT method can pay – https://github.com/
Yan, 2017) VTT GoogleNet attention to multiple tuyunbin/Video-
prominent objects, Description-with-Spatial-
thus generating Temporal-Attention
detailed and accurate
descriptions.
(Jin & Liang, 2016) 30 LSTM-RNN MSVD VGGNet LSTM The combination of METEOR score is –
& CNN acoustic and visual only used as an
information evaluation metric
representation of
videos improves the
description
31
Stage Generation
performance to a
great extent.
(Venugopalan, 1239 LSTM MSVD AlexNet, LSTM The model shows METEOR score is https://github.com/
et al., 2015) MVAD VGG significant only used as an vsubhashini/caption-eval
MPII-MD improvement in all evaluation metric
three datasets by
improving the overall
descriptive quality of
sentences
(Xu, Venugopalan, 60 CNN-FCN + MSVD AlexNet, LSTM The model is efficient Several fine-tuning https://github.com/
Ramanis, LSTM VGG, and extensible Its mechanisms are takeuchi-lab/MS-DA-MIL-
Rohrbach, & GoogLeNet efficiency makes it tried with an CNN
Saenko, 2015) especially suitable to integrated model.
process video, where
region proposal
generation
mechanisms would be
prohibitively slow
(Venugopalan, 128 S2VT M− VAD CNN + LSTM Variable-length input METEOR score is https://gist.github.com/
Anne, Mooney, & (LSTM) MPII-MD LSTM and output can be only used as an vsubhashini/
Saenko, 2016) MSVD handled, has a high evaluation metric 38d087e140854fee4b14
model capacity, and
can learn complex
temporal structures
(Xu, Xiong, Chen, 253 Deep Video YouTube Better results are Need for a larger –
& Corso, 2015) Model and Video obtained in dataset to improve
Joint comparison to SVM, the results
Embedding CRF, and CCA. The
mean rank of video
and text retrieval is
high
(Pan, Mei, Yao, Li, 514 LSTM-E MSVD, MPII- VGG, C3D LSTM-RNN LSTM-E provides RNN technique –
& Rui, 2016) MD, M− VAD better results for SVO provides a better
prediction and representation of
sentence generation. videos
(Ji, Wang, Tian, & 2 ADL MSVD Inception- LSTM The model decreases – –
Wang, 2022) MSR-VTT V4 the semantic gap
between raw videos
and generated
sentences.
(Zhang, et al., ORG-TRL MSVD 2D-3D CNN LSTM The proposed model The baseline model –
2020) MSR-VTT recognized more can only
VATEX detailed objects. understand the
general meaning of
the video
(Chen J., et al., – R-ConvED MSVD Conv- Conv- Provided superior Complex Model –
Retrieval MSR-VTT Encoder Decoder results on all with the
Augmented ActivityNet benchmark datasets. introduction of
Convolutional VATEX RAM in the model.
Encoder-Decoder
Networks for
Video
Captioning,
2022)
(Seo, Nagrani, MV-GPT YouCook2 ViViT BERT This model provides This approach is –
Arnab, & MSR-VTT state-of-the-art not always a
Schmid, 2022) ActivityNet results on all datasets successful option
HowTo100M for sentence
generation
improves the video description to a great extent. The methods when compared with simple encoder-decoder architecture. Also, atten
mentioned above for the seq2seq approach achieved excellent results tion mechanism-based models were able to extract complex features.
but they were able to extract features from the short video frames that Video captioning models focussed mainly on global frame features while
represented local features only. Further, (Nian, et al., 2017) proposed a paying less attention to salient objects. (Zhao et al., 2018) addressed this
method known as Video Response Map (VRM) that captured both local issue with the use of the tube feature model. Bi-directional LSTM is used
and global video features and was further used to learn high-level video as an encoder while a single LSTM extended with an attention model is
attribute features. This model was trained on video sentence pairs and used as a decoder. (Chen et al., 2019b) introduced Temporal Deformable
they utilize caption transfer based on similar videos or video segments. Convolutional Encoder-Decoder Network (TDConvED) for addressing
Variable-length input problems cannot be solved with the use of the limitations of LSTM-based models for long-term dependencies. This
sequence-to-sequence methods discussed above. Therefore, attention- model employed convolutions in both encoder and decoder networks
based encoder-decoder models Cho et al., (2015, July). (Zhao et al., and also capitalized on the temporal attention mechanism that is used
2018) are presented that give better results for variable-length videos for the generation of sentences. Extensive experiments were carried out
32
Table 11
Transformer-Based Models for Video Captioning.
Ref. Citations Method Dataset Encoding Text Advantages Limitations GitHub Code
Stage Generation
(Sun, 408 CBT (Extended Breakfast, S3D Transformer more useful than Works well with https://github.com/
Myers, BERT) 50Salads, existing self-supervised smaller datasets only. Tianwei-She/
Vondrick, ActivityNet methods for a variety of representation-learning-
Murphy, downstream video tasks, papers
& such as classification,
Schmid, captioning, and
2019) segmentation
(Sun, 83 VideoBERT YouCook II able to learn high-level Limited to dataset https://github.com/
Baradel, semantic only for cooking, rich yuewang-cuhk/
Murphy, representations, and we video frames are not awesome-vision-
& outperform the state-of- fully utilized language-pretraining-
Schmid, the-art video captioning papers
2019) on the YouCook II
dataset. Can be used for
open vocabulary
classification whose
performance grows
monotonically with the
size of the training set.
(Zhu & 80 ActBERT with Crosstask, Improves the overall Removal of regional –
Yang, TNT MSR-VTT, performance by 4%, information leads to
2020) YouCook2 better pre-training a performance drop
framework, TNT when compared with
enhances the a full model
communication
(Li, et al., 70 HERO TV + ResNet Transformer Captures temporal https://github.com/
2020) How2100M alignment both locally linjieli222/HERO
and globally on two
large-scale datasets.
(Luo, et al., 56 UniVL How2100M, ResNet + Transformer It is a flexible model for Joint loss decreases https://github.com/
2020) YoucookII, ResNeXt, most of the multimodal the generation task a microsoft/UniVL
MSR-VTT S3D downstream tasks little, although it
considering both performs well in the
efficiency and retrieval task.
effectiveness Excessive emphasis
on coarse-grained
matching can affect
the fine-grained
description at the
generation task.
(Jin, Huang, 9 Vanilla MSVD, ResNet, I3D Transformer SBAT generates more – https://github.com/
Chen, Li, Transformer + MSR-VTT accurate descriptions NonameAuPlatal/
& Zhang, SBAT that are close to GT and Dual_Learning
2020) improve attention logits
in a vanilla transformer.
Reduces feature
redundancy.
SBAT with boundary-
aware attention, local
correlation, and aligned
cross-modal interaction
achieves promising
results under all the
metrics.
(Sur, 2020) 5 SACT + ActivityNet ResNet Transformer This novel architecture METEOR, ROUGE L, –
Multimodal YoucookII with better content and CIDEr-D provide
Attention understanding helps a limited perspective
with an improved of the generated
regional proposal and captions and, hence,
later translates the a qualitative
filtered contents to evaluation of the
better captions for generated captions is
videos inevitable
(Fang, 16 V2C- V2C, MSR- ResNet CMS Rich commonsense The ATOMIC dataset https://github.com/
Gokhale, Transformer VTT, captions are generated albeit with noise and jacobswan1/
Banerjee, Architecture ATOMIC with factual annotations is not Video2Commonsense
Baral, & descriptions. Describes relevant to the video.
Yang, intentions, effects MSR-VTT samples
2020) attributes, and actions don’t have clear
that happen in future human activities
resulting imbalance
(Zhong, – BiTransformer MSVD Transformer Transformer –
Zhang, MSR-VTT
33
Ref. Citations Method Dataset Encoding Text Advantages Limitations GitHub Code
Stage Generation
Wang, & Produces better captions Provides slightly

Xiong, when compared with worse performance
2022) vanilla transformer for BLEU-4
(Li, et al., 1 LSRT MSVD 3D-CNN Transformer – – –
2022) MSR-VTT
(Gao, et al., – D 2 -Transformer MSVD Transformer Transformer This model can Complex Analysis –
2022) MSR-VTT recognize more detailed
VATEX activities
(Lin, et al., – SWINBERT MSVD Transformer Transformer Provides superior results Model Performance https: //github.com/
2022) YouCookII on all datasets used can be further microsoft/SwinBERT
VATEX improved with pre-
training on even
larger-scale datasets
(Shi, et al., – VTAR VATEX C3D Transformer Provides improvements Baseline models –
2022) MSR-VTT in out-of-domain video generated incorrect
captioning captions for some
instances.
on MSVD and MSR-VTT datasets that provided superior results when unlabelled videos and is used for generative tasks. This model generates
compared with traditional RNN-based encoder-decoder techniques. a caption from raw pixels and transcribed the generated caption into
Another work by (Yao, et al., 2015) presented a sequence-to-sequence speech directly.
model with temporal attention. It took into account both the local and 3.2.1.3.2. Transformer-Based Methods. Transformer-based methods
global temporal structure of videos to produce descriptions using 3-D have recently shown exemplary performance on a broad range of lan
CNN-RNN encoder-decoder architecture for local-spatiotemporal infor guage tasks. The breakthroughs from transformer networks in the NLP
mation. (Gao, Guo, Zhang, Xu, & Shen, 2017) resolved the issue of a domain have sparked great interest in CV. These methods use a soft-
translation error and ignorance of the correlation between sentence attention mechanism that learns the relationships between elements of
semantics and visual content by introducing a-LSTM. This model is an a sequence. Transformer-based architecture (Devlin, Chang, Lee, &
end-to-end framework with an attention mechanism that captures Toutanova, 2019) (Sun, Myers, Vondrick, Murphy, & Schmid, 2019) is
salient structures of videos for the generation of sentences with rich also capable of handling variable-sized inputs using stacks of attention
semantic content. Furthermore, (Tu et al., 2017) presented Spatial- layers instead of RNN or CNN further, we can say that these methods are
Temporal Attention (STAT) framework. This method recognized stacks of encoder or decoder layers. (Devlin, Chang, Lee, & Toutanova,
salient objects more precisely with high recall and automatically focused 2019) introduced simple but powerful Bidirectional Encoder Represen
on the most relevant spatial–temporal segments. Yang et al., (2018, tations using Transformers (BERT). It was based on pre-training and
November). worked on the deficiencies of LSTM-based methods by fine-tuning that successfully tackled a broad set of NLP tasks. (Sun,
introducing LSTM-GAN. Joint LSTM with adversarial learning out Myers, Vondrick, Murphy, & Schmid, 2019) presented an extended
performed the existing methods and has the scope to further improve the version of BERT known as VideoBERT that can be used in numerous
performance with reinforcement learning. tasks like action classification and video captioning. It learned from the
Recent works (Long, Gan, & Melo, 2016) (Pan et al., 2016) for quantized video frames and provided quantitative results for the models
encoder-decoder architecture included the injection of semantic attri that learned high-level semantic features. Though rich video frame
butes into sequence learning for video captioning. These attributes ex features were not fully utilized (Sun, Myers, Vondrick, Murphy, &
press semantic information that provided better visual recognition. (Pan Schmid, 2019) as video frames are represented by only discrete tokens.
et al., 2016) proposed LSTM with transferred semantic attributes (LSTM- Therefore, a Contrastive Bi-directional Transformer (CBT) (Sun et al.,
TSA) to extract semantic attributes from video frames. The method (Pan 2019) was defined as a remedy for this. It used a contrastive loss for
et al., 2016) did not make full use of pertinent semantic cues. (Long, video representation learning alone that provided improved perfor
Gan, & Melo, 2016) presented a unified framework defining, multifac mances in many downstream tasks like video classification, captioning,
eted attention layers (MFATT) with LSTM. This model outperformed and, segmentation. In the year 2020, (Luo, et al., 2020) presented a
earlier works and performed robustly even in the presence of added Unified Video and Language pre-training model (UniVL) for both
noise. Bin et al. Bin et al., (2018, May). discussed a technique that in multimodal understanding and generation. It improved video-text-
tegrated bi-LSTM and soft attention. This model preserved global tem related downstream tasks like text-based video retrieval and multi
poral and visual information and enabled a language decoder to focus on modal video captioning. UniVL also provided state-of-the-art results by
complex events. (Ji, Wang, Tian, & Wang, 2022) combined encoder- demonstrating that it can learn strong video-text representations. Act
decoder reconstructor and multi-head attention and proposed a novel BERT (Zhu & Yang, 2020) is a self-supervised learning technique for
dual-learning approach for video captioning. The proposed method joint video-text representation of unlabeled data. It uncovers the global
minimized the semantic gap between raw videos and the generated and local visual clues from paired video sequences and sentence gen
captions and enhanced the generated video captions’ quality. Further, eration. The success of the models (Zhu & Yang, 2020) is limited by
(Zhang, et al., 2020) proposed an object-relational graph-based encoder several constraints: (i) loss of temporal alignment between video and
with a teacher-recommended learning method to integrate abundant text for simple concatenation of subtitle sentences, and (ii) video data
linguistic knowledge into the captioning model. Also, the proposed sets available for current models are restricted to only cooking and
model achieved superior results on MSVD, MSR-VTT, and VATEX narrated videos. To overcome these challenges (Li et al., 2020)presented
datasets. Furthermore, (Chen et al., 2022)proposed Retrieval a model namely Hierarchical Encoder for Video + Language (HERO).
Augmented Convolutional Encoder-Decoder Networks (R-ConvED) that This framework is a video-and-language large-scale pre-training that
novelly integrated RAM into a convolutional encoder-decoder structure contains a cross-modal transformer and temporal transformer for multi-
which facilitated the word prediction and boosted the performance of modal fusion. HERO (Li, et al., 2020) proved a new state-of-the-art on
video captioning. (Seo, Nagrani, Arnab, & Schmid, 2022) presented multiple benchmarks for different domains when compared with flat
Multimodal Video Generative Pretraining (MV-GPT) that is learned from BERT (Devlin, Chang, Lee, & Toutanova, 2019). (Jin, Huang, Chen, Li, &
34
Table 12
Hierarchical-Based Methods for Video Captioning.
Stage Generation
(Pan, Xu, Yang, Wu, 370 HRNE MSVD ConvNet LSTM-GRU HRNE reduces the length For the M− VAD dataset, –
& Zhuang, 2016) M− VAD of input information flow the BLEU metric is not
and exploits temporal considered in the
structure in a longer range experiment
at a higher level and is
more non-linear, flexible,
and generic. can be
applied to a wide range of
video applications.
(Yu, Wang, Huang, 547 h-RNN YouTube VGGNet GRU Outperforms the current There is a discrepancy –
Yang, & Xu, 2016) Clips, state-of-art methods with between the objective
TACoS- BLEU@4 scores and function used by training
MultiLevel generates a sequence of and the one used by
sentences for a given generation, the detection
video data. of small objects in
complex videos is still a
problem in this technique
(Liu A.-A., et al., 29 HVMC MSVD VGGNet RNN The HMVC model In terms of METEOR –
2017) MPII-MD discovers the internal HVMC is slightly worse on
knowledge by jointly MSR-VTT.
learning the dynamics
within both visual and
textual knowledge
conveyed in the fine-grain
video units
(Mehri & Sigal, 14 Dual-Self MSVD Inception LSTM The quality of the – –
2018) Attention proposed model depends
mechanism on the quality of the
middle word (verb)
classifier. The classifier
proposed is simple, but
could be substantially
improved using
specialized action
classification
architectures.
(Wu & Han, 2020) 1 LSTM + MSVD, GoogLe LSTM MemNet-based decoder – –
MemNet MSR-VTT Net, obtains competitive
based ResNet performance and
decoder outperforms RNN decoder
(Dave & – – ActivityNet C3D LSTM Two LSTMs produce – –
Padmavathi, coherent and contextually
2022) aware sentences.
(Zhang & Peng, 125 OA-BTG MSVD C3D GRU OA-BTG achieves state-of- The proposed model –
Object-aware MSR-VTT the-art performance in needs to not only model
Aggregation with terms of BLEU@4, salient objects with their
Bidirectional METEOR, and CIDEr trajectories, but also
Temporal Graph metrics understand interaction
for Video relationships among
Captioning, 2019) objects
(Zhang & Peng, 32 OSTG MSVD C3D GRU Provides state-of-the-art The proposed model fails https://github.
Video Captioning MSR-VTT performance in terms of to describe some temporal com/PKU-ICST-
With Object- BLEU@4, METEOR, and information and detailed MIPL/
Aware Spatio- CIDEr metrics content. OSTG_TIP2020
Temporal
Correlation and
Aggregation,
2020)
Zhang, 2020) focused on the problem of applying the transformer improved the quality and create better attention composition. Video-to-
structure to video captioning effectively and introduced a space commonsense (V2C) (Fang, Gokhale, Banerjee, Baral, & Yang, 2020)
boundary-aware transformer (SBAT). It reduced the redundancy in framework generated video descriptions with commonsense. This model
video representation and extracted diverse features from different sce included intentions (why the action is taking place), effects (what
narios with a boundary-aware pooling mechanism. This model is changes due to actions), and attributes that describe the whole action.
considered an improvement in transformer-based encoder-decoder ar V2C is the first generative model (Fig. 22) for commonsense video
chitecture for the representation of videos. (Sur, 2020) worked on captioning with rich commonsense descriptions. It contains a cross-
improvement in capturing suitable contents and proposed Self-Aware modal transformer model for the generation of common-sense-
Compositional Transformer (SACT) with feature understanding at the enriched descriptions of videos. The model also adopted a video
frame level. SACT (Sur, 2020) engaged in more refined feature content encoder that extracted global representations from the input video and a
and composition of usable attention for deriving better representation. It transformer decoder that produced relevant common-sense knowledge
also addressed the problems that arise due to diversification and along with captions. Further, (Fang, Gokhale, Banerjee, Baral, & Yang,
35
Table 13
Deep-Reinforcement Learning Based Methods for Video Captioning.
Ref Citations Method Dataset Encoding Text Advantages Limitations GitHub Link
Stage Generation
(Wang, 172 HRL MSR-VTT, ResNet HRL A large-scale dataset was Results can be further https://github.com/BUPT/
Chen, Wu, Charades introduced for fine-graded improved by C3D conversational-ai-club/
Wang, & video captioning and features, optical blob/master/docs/papers/
Wang, obtains state-of-art flows, etc. (Krishna, video-captioning-via-
2018) performance on MSR-VTT Hata, Ren, Fei-Fei, & Hierarchical-reinforcement-
Niebles, 2017) learning-2018.md
(Pasunuru 95 CIDEnt MSR-VTT, Inception LSTM-RNN The CIDEnt-reward model CIDEnt score is only https://github.com/
& Bansal, MSVD produces better-entailed high when both ramakanth-pasunuru/
2017) captions than the ones CIDEr and the video_captioning_rl
generated by the CIDEr entailment classifier
reward model, achieves achieve high scores.
new state-of-art to MSR-
VTT, and generates
ground truth style
captions
(Chen, 106 PickNet MSVD, ResNet LSTM, GRU The fast method achieves
Wang, MSR-VTT competitive results even
Zhang, & without utilizing attribute
Huang, information This
2018) architecture can be
applied to streaming
videos.
(Phan, 12 CST with MSR-VTT ResNet, LSTM A variant of REINFORCE The Conventional XE https://github.com/mynlp/
Henter, REINFORCE C3D, MFCC algorithm that establishes training model is as cst_captioning
Miyao, & algorithm. a new state of the art with fast as CST-RL
Satoh, fine-tune generated
2017) captions which removes
the exposure bias
(Wei, Mi, 4 LSTM model in MSVD, VGGNet, LSTM The method applies LSTM this model generates https://github.com/
Hu Zhen, a sliding MSR-VTT, ResNet, in a sliding window more detailed rayat137/Pose_3D
& Chen, window Charades C3D manner which exploits the sentences but gets a
2020) manner local temporal lower BLEU @4
information more score.
efficiently.
2020) attempted to use open-ended video-based commonsense question knowledge (i.e., frame-based image caption) and benefits for video
answering to generate rich captions. captioning. This model worked well for both textual and visual modal
The methods discussed above could not completely use the semantic ities. Fig. 23 depicts the difference between the sequence-to-sequence
concept as these ignore the right-to-left style of context and are more model (Venugopalan et al., 2016a) and HVMC which generates text
focused on the left-to-right style. Therefore, to address this issue descriptions directly from the visual source. In contrast, the proposed
BiTransformer (Zhong, Zhang, Wang, & Xiong, 2022) was proposed that HMVC model integrated the rich intermediate knowledge by leveraging
captured left-to-right semantic context which results in better captions. the latent semantic knowledge in each frame to compose descriptions of
To better explore the relationship among objects, in video captioning the given video. Sequence- to-Sequence models (Venugopalan et al.,
and capture long-term and short-term dependencies (Li, et al., 2022) 2016a) are challenged by a lack of diversity and the inability to be
proposed long short-term relation transformer. This method alleviated externally controlled. (Mehri & Sigal, 2018) presented a middle-out
the problem of over-smoothing and strengthened relational reasoning. decoding method that began from an initial middle word and simulta
Further, to generate visual semantic and syntax-related words in equal neously expanded the sequence in both directions and provided an
proportions a dual-level decoupled transformer (Gao, et al., 2022) is improvement in sequence-to-sequence learning from left-to-right.
proposed. This model generated more reasonable and fine-grained Though middle-out decoding improves the decoding efficiency this is
captions for videos. Also, (Lin, et al., 2022) presented an end-to-end unable to tackle the long-term information dilution problem. To avoid
transformer, namely SWINBERT. The proposed transformer made use loss of long-term information in the decoding stage. (Wu & Han, 2020)
of sparse attention and took video frame patches as inputs and generates defined MemNet-based decoder or hierarchical-based decoder for video
high-quality captions and provided improvements in long-range video captioning. This model is advantageous for storing long-term informa
sequence modeling. Furthermore, (Shi, et al., 2022) proposed a video- tion and outperforms the RNN decoder.
text alignment module with a retrieval unit for video captioning. This Hierarchical Methods for Video Captioning find applications in the
model reduced the semantic gap between the visual and the textual data field of paragraph generation. (Yu, Wang, Huang, Yang, & Xu, 2016)
by producing correct and distinctive captions. have touched upon paragraph generation using hierarchical RNN (h-
3.2.1.3.3. Hierarchical Methods. Hierarchical-based video RNN). This model exploited both temporal- and spatial- attention
captioning techniques use temporal hierarchical methods or spatial hi mechanisms to selectively focus on visual elements during generation
erarchical methods or a combination of these that can detect the and was successful in describing a realistic video either in a single sen
occurrence of objects in a video. (Pan, Xu, Yang, Wu, & Zhuang, 2016) tence or paragraph with the use of RNN and Gated Recurrent Unit (GRU)
presented a Hierarchical Recurrent Neural Encoder (HRNE) to model (Huang, Ma, & Zhang, 2009) as the decoder and achieved state-of-the-
video temporal information using a hierarchical recurrent encoder. This art results on YouTubeClips and TACoS-Multilevel. (Zhang and Peng,
model uncovered the temporal transitions between frames at different 2019b) captured detailed temporal dynamics for salient objects in
levels. (Liu et al., 2017) defined Hierarchical and Multimodal Video videos. Furthermore, to capture the objects’ relationships both within
Captioning (HMVC) model that integrated rich and primeval external and across frames presented object-aware spatio-temporal graph (OSTG)
36
Table 14
Overview of Dense Video Captioning.
Ref Citations Method Dataset Image Text Advantages Limitations GitHub Link
Encoder Generation
(Senina, et al., 2014) 189 SMT TACoS – SMT This work outperformed the The system predicted, –
retrieval-based approach, “Preparing orange
due to more accurate juice” instead of
recognition of activities/ “Juicing a lime”,
objects. confusing the main
object of the video.
(Shin, Ohnishi, & 26 CNN-RNN Montreal, VGGNet, LSTM The method outperformed Results can further be –
Harada, 2016) MPII, MS iDT the single-frame method improved by face
while being comparable to identification.
the current state-of-the-art
method.
(Yu, Wang, Huang, 547 h-RNN YouTube VGGNet, GRU Outperformed the current Detection of small –
Yang, & Xu, 2016) Clips CD3 state-of-the-art with objects in complex
TACoS BLEU@4 scores. videos is still a
problem in this
technique
(Xiong, Dai, & Lin, 42 LSTM net ActivityNet LSTM LSTM Generates paragraphs with Efficiency can be
2018) the temporal structure of increased with other
given videos which covers datasets
major semantics without
redundancy.
(Escorcia, Heilbron, 332 DAP THUMOS- C3D LSTM Produces high-quality The quality of https://github.
Niebles, & Ghanem, 14 segments, the most effective segments can be com/
2016) and efficient network that increased for a large escorciav/
produces temporal segments variety of activity deep-action-
over a long video sequence lengths proposals
(Zhou, Zhou, Corso, 235 CNN + ActivityNet ResNet, Transformer Provides better results than Not suitable with a 1- https://github.
Socher, & Xiong, Masked YouCookII ProcNet RNN-based models for event layer transformer. com/
2018) Transformer proposal and captioning salesforce/
tasks. densecap
(Mun, Yang, Ren, Xu, 54 SDVC ActivityNet C3D LSTM This method achieved state- https://github.
& Han, 2019) of-art results in terms of com/
METEOR ttengwang/
ESGN
(Lie, et al., 2020) 38 MART ActivityNet Transformer This model generates more Repetition still exists https://github.
YouCookII coherent paragraphs within a single com/jayleicn/
without losing paragraph sentence. recurrent-
accuracy transformer
(Zhang, Xu, Ouyang, 12 DaS with ActivityNet ResNet, LSTM Effective Model for Dense Complex Model –
& Tan, 2020) LSTM C3D Video Captioning as the
model is divided into
division and summarization
framework.
(Suin & Rajagopalan, 5 CNN + LSTM ActivityNet ResNet LSTM The model works well with Performance –
2020) + Self the complex task of dense parameters that are
Attention video captioning to reduce used are BLEU and
the overall computational METEOR only
cost
(Iashin & Rahtu, 29 MDVC ActivityNet VGGish, Transformer Audio and speech – github.com/
Multi-modal Dense I3D, C3D modalities improve the viashin/MDVC
Video Captioning, performance of dense video
2020) captioning
(Iashin & Rahatu, A 23 – ActivityNet VGGish, Bi-Modal Provides improvements in – viashin.github.
Better Use of Audio- I3D, Transformer the results when compared io/bmt
Visual Cues: Dense GloVe to (Iashin & Rahtu, Multi-
Video Captioning modal Dense Video
with Bi-modal Captioning, 2020) with the
Transformer, 2020) use of a bi-modal
transformer
(Song, Chen, & Jin, – – ActivityNet, ResNet, – A key-frame-aware encoder – https://github.
2021) Charades I3D is proposed which improves com/syuqings/
the efficiency by generating video-
more coherent and diverse paragraph
video paragraphs.
(Wang, et al., 2021) – PDVC ActivityNet C3D, TSN Transformer – – https://github.

Youcook2 com/
ttengwang/
37
Ref Citations Method Dataset Image Text Advantages Limitations GitHub Link
Encoder Generation
PDVC
(Bao, Zheng, & Mu, 1 DepNet ActivityNet C3D LSTM This model effectively For some sentences of –
2021) TACoS captures the temporal more than 5, the
context in the improvement is almost
accompanying paragraph of saturated
a video
(Yamazaki, et al., – VLCap YouCookII C3D Transformer This model provides the best Complex Model https://github.
2022) ActivityNet performance with large gaps com/UARK-
in accuracy metrics and AICV/VLCAP
diversity metrics
Fig. 20. Overview of Deep-Learning-Based Video Captioning steps.
approach (Zhang and Peng, 2020) for video captioning. ((Zhang and CIDEr. Both metrics are traditional phrase-matching metrics that
Peng, 2019b) and (Zhang and Peng, 2020) also presented hierarchical somehow fail to capture the logical correctness of the generated sen
attention and learned discriminative spatio-temporal features to achieve tences. To generate the logically correct description of videos, (Pasunuru
state-of-art performance in terms of BLEU@4, METEOR, and CIDEr & Bansal, 2017) defined entailment-enhanced reward (CIDEnt) that
metrics. (Zhang and Peng, 2019a) presented an attention-guided hier corrected phrase-matching-based metrics such as CIDEr. It allowed
archical alignment approach for video captioning. This method explored logically implied partial matches and avoided contradictions. PickNet
multi-granularity visual features to capture coarse-to-fine visual infor (Chen et al., 2018e) performed informative frame picking for video
mation and obtain a comprehensive understanding of complex and dy captioning. This model is based on an encoder-decoder architecture that
namic video content. Furthermore, (Dave & Padmavathi, 2022) trained the network sequentially and obtained an efficient reduction in
proposed a hierarchical video captioning model that used bi-directional noise without the loss of efficacy and preserved flexibility Consensus-
alteration of a single-stream temporal action proposal network and Based Sequential Training (CST) (Phan et al., 2017) a variant of REIN
produced coherent and contextually aware descriptions. FORCE () boosted the video captioning performance and improved the
3.2.1.3.4. Deep-Reinforcement Learning-Based Methods. Sequence-to- diversity of generated captions. (Wei, Mi, Hu Zhen, & Chen, 2020)
Sequence models discussed earlier have shown promising results in presented a reinforcement learning-based method that predicted the
abstracting a coarse description of videos, nonetheless, they still are adaptive sliding window size sequentially for better event exploration. A
unable to caption a video that contains multiple fine-grained actions. In single Monte Carlo sample was introduced that approximated the
this sub-section different state-of-the-art are discussed which are based gradient of reward-based loss functions.
on reinforcement learning that can generate video descriptions with
multiple fine-graded actions. (Wang, Chen, Wu, Wang, & Wang, 2018) 3.2.2. Dense or Paragraph Captioning of Videos
presented a Hierarchical Reinforcement Learning (HRL) framework, The methods, discussed in sub-section 3.2.1.3, described a video in a
Fig. 24, composed of a novel high-level Manager module and a low-level single sentence. These methods are appropriate for short video de
Worker module to design sub-goals and recognize primitive actions. This scriptions with a single event. But with the increase in length of the
model achieved state-of-the-art results on MSRVTT and Charades data videos, it is more likely that the video contains multiple events whose
set for fine-graded video captioning with the highest citation received semantics complexity increases. Therefore, single sentence generation
among other state-of-the-art as shown in Table 13. (Krishna et al., 2017) for multiple events videos is not sufficient with the increased semantic
performed multitask reinforcement learning by capturing the de complexity. Thus, the most recent works (Yu, Wang, Huang, Yang, & Xu,
pendencies between the events in a video and introduced a new 2016) (Senina, et al., 2014) have shifted focus to dense captioning or
captioning module that used contextual information from past and paragraph generation for videos. Dense or paragraph Based captioning
future events to Fig. 25 jointly describe all events. Evaluation metrics of videos describes each activity and interaction in form of natural
used by deep reinforcement learning-based methods included BLEU and language. Temporal localization of events in videos and their description
38
Fig. 21. Illustration of (a) Sequence-to-sequence Based Model (b) Sequence-to-Sequence Model with Attention.
Fig. 22. The V2C-Transformer model architecture contains: (a) Video Encoder, (b) Decoder module consisting of a Caption Decoder and a Common-sense Decoder,
and (c) Transformer Decoder module (Fang, Gokhale, Banerjee, Baral, & Yang, 2020).
39
et al., 2016) are limited, and hence, (Yu, Wang, Huang, Yang, & Xu,
2016) described a key framework that is known as hierarchical RNN or
h-RNN. It represented a long video with a paragraph. It utilized two
generators the first one generates single short sentences for specific time
intervals and regions in videos. While the second generator composed
paragraphs by using a recurrent layer. Sentential information in videos
has flown in a unidirectional way through the paragraph recurrent layer.
This somehow misleads the information that will pass down when
initially several sentences are generated incorrectly. (Escorcia, Heilbron,
Niebles, & Ghanem, 2016) resolved this issue with the use of bidirec
tional RNN. Deep Action Proposals (DAP) (Escorcia, Heilbron, Niebles,
& Ghanem, 2016) utilized vast capacity deep-learning models and
memory cells to retrieve from untrimmed video temporal segments.
(Krishna et al., 2017) worked on the representation of natural videos
that involved the detection and description of frames. This model
improved the performance of the dense captioning events, retrieval of
video, and localization. (Xiong et al., 2018) treated an entire video as a
whole and generate a caption conditioned on a single embedding. This
approach produced a descriptive caption by assembling temporally
localized descriptions. In Figure 25 it can be observed that important
semantic events in videos are localized and coherent paragraph de
scriptions are generated for each semantic event. The problem of accu
rate generation of captions is of onerous task and the methods proposed
earlier are either trained separately or in alteration. To handle this issue
Fig. 23. Illustration of the difference between Sequence-to-Sequence Vision-to- an end-to-end model (Zhou et al., 2018a) based on a transformer and
Text (S2VT) and Hierarchical and Multimodal Video Captioning (HVMC), (Liu self-attention mechanism is defined. It employed a masking network to
A.-A., et al., 2017). restrict its attention to the proposed event over the encoding feature. In
2019, Zhang et al., (2020, September). presented a framework for dense
is involved in dense captioning applications. Furthermore, the problem video captioning namely DaS. This model partitioned each untrimmed
of overlapping events that may be present is addressed by dense long video into multiple possible events that extracted C3D features for
captioning methods. Dense or Paragraph based captioning of videos is the generation of a single sentence for each segment. Further, a dense
an emerging area in video captioning and is expected to flourish over the video captioning model is presented as a video cue-aided sentence
years, as it fosters multiple event localization along with caption gen summarization. It consists of a novel 2-stage LSTM approach equipped
eration for each event taking into account the overlap between the with a hierarchical attention mechanism that summarized all generated
events. With the introduction of a large-scale dataset with dense anno sentences in form of a descriptive sentence. Dense video captioning is a
tations of sentence descriptions, such problems have become feasible to very challenging task as it requires a holistic knowledge of video con
solve. tents as well as contextual reasoning of individual events. (Suin &
(Senina, et al., 2014) presented a multiple caption generation Rajagopalan, 2020) presented a deep reinforcement technique that
framework that generated coherent multi-sentence descriptions of enabled an event to describe multiple events in a video by watching a
complex videos. This method included the prediction of semantic rep portion of the frames. This model somehow improved the accuracy with
resentation from videos and generate natural language descriptions a substantial reduction in the computational task.
from semantic representation. (Shin et al., 2016) described a technique Most video captioning approaches failed to consider temporal de
that conveyed richer content with action localization and temporal pendency between events and generated redundant and inconsistent
segmentation for the generation of story-like captions. In the case of descriptions. To overcome this (Mun et al., 2019) proposed a model that
inter-sentence dependencies, the methods (Senina, et al., 2014) (Shin considered temporal dependency across events in a video and leverages
visual and linguistic context from prior events for coherent storytelling.
Fig. 24. Overview of the HRL framework for video captioning (Wang, Chen, Wu, Wang, & Wang, 2018).
40
Fig. 25. Framework for Progressive Generator of Video Paragraph Descriptions (Xiong, Dai, & Lin, 2018).
(Wang, et al., 2021) presented a framework for dense Video Captioning 3.2.3. Discussion
with Parallel Decoding (PDVC) with prediction tasks. This method The scope of video captioning has increased with the enormous
effectively increased the coherence and readability of predicted cap improvement in technology and captioning algorithms. Template-based
tions. (Lie, et al., 2020) presented an approach called Memory- techniques (Kojima, Tamura, & Fukunag, 2002) (Khan, Zhang, & Gotoh,
Augmented Recurrent Transformer (MART), which used a memory 2011) are a two-stage process, namely, visual attribute identification
module to augment the transformer architecture. The memory module and natural language generation. These methods are based on the
helped better prediction of the next sentence and encouraged coherent detection of SVO in a video and provide rigid captions but this technique
paragraph generation. addressed only limited actions which are pre-defined and do not
Most of the work discussed earlier focused mainly on visual infor generate efficient video descriptions. For these methods, the evaluation
mation and completely ignores the audio track. (Iashin and Rahtu, 2020) is limited to a narrow domain with a small vocabulary. Such techniques
described events in videos by any number of modalities. This model have generated captions for only a single person in videos or can only
proved how audio and speech modalities can improve the performance describe the motion of a vehicle moving in traffic. These techniques
of dense video captioning techniques. The model (Iashin and Rahtu, somehow fail to deal with multiple-person interaction and fail to
2020) utilized Automatic Speech Recognition (ASR) in addition to a generate a fluent and long description of videos. Further, ML-based
transformer-based caption encoding mechanism. A method (Iashin and techniques strengthened visual tracking, and recognition and high
Rahatu, 2020) similar to (Iashin and Rahtu, 2020) showed significantly lighted the issue of large vocabulary (Xu et al., 2015c). However, such
better results on visual and audio cues by utilizing a bi-modal encoder in techniques lacked the efficiency of working with more visual and se
the captioning module. It proved to be an elegant approach for dense mantic information. Here, the main challenge lies in short-range activity
video captioning. Video paragraph captioning described the events in detection which lacks the robust models to handle the massive compu
untrimmed videos by event detection and event captioning. This made tation of frame-by-frame motion calculation which fails to describe the
the quality of generated paragraphs highly dependent on the accuracy of whole scene. With the evolution of deep-learning-based techniques for
event proposal detection which is a challenging task. (Song et al., 2021) SSVC, improvements in the extraction of visual and semantic informa
presented a paragraph captioning model for untrimmed videos that tion and generation of efficient descriptions are observed. Sequence-to-
spurn the problematic event detection stage and generated paragraphs. Sequence models (Venugopalan, et al., 2014) (Sutskever, Vinyals, &
As untrimmed videos contain massive but redundant frames. This model Quoc V. Le, 2014) are based on encoder-decoder architecture and used
augmented the video encoder with key-frame awareness that improved LSTM as the building block. These models (Venugopalan, et al., 2015)
efficiency. Furthermore, (Bao, Zheng, & Mu, 2021) presented the Dense (Xu et al., 2015a) allow variable lengths of input and output. Improve
Events Propagation Network (DepNet) for generating paragraphs for ment in the quality and diversity of captions can be achieved by an
untrimmed videos by effectively exploiting both the temporal order and attention-based mechanism (Gao, Guo, Zhang, Xu, & Shen, 2017) for
semantic relations of dense events. (Yamazaki, et al., 2022) presented sequence-to-sequence learning which considered the both local and
VLCap to generate coherent paragraph descriptions for untrimmed global temporal structure of videos to product descriptions.
videos. This model described both human and non-human objects, vi Transformer-based models (Jin, Huang, Chen, Li, & Zhang, 2020) (Sur,
sual and non-visual components. 2020) generated more accurate descriptions of videos that are close to
the ground truth and overcome the limitations of the LSTM technique.
To handle complex videos the Temporal hierarchical structures (Pan,
41
Xu, Yang, Wu, & Zhuang, 2016) and spatial hierarchical structures (Liu also presents a summary of deep-learning-based methods per
et al., 2017) helped to detect the occurrence of objects in a video and formances for image captioning and video captioning with their
provide an accurate description for a large number of long-range de datasets and evaluation metrics in Tables 17 and 18. This kind of
pendencies in complex videos. Reinforcement learning is much more analysis will help to know the recent out-performing state-of-the-
advantageous as they are not limited to a specific evaluation metric art, to begin with, experiments and provide better clarity of the
(Pasunuru & Bansal, 2017) that boosts the model performance and im techniques.
proves the diversity of generated captions. It is noticed that dense or
paragraph video captioning (Senina, et al., 2014) describes every ac
tivity and interaction in a video in the form of natural language. Few 4.1. Datasets
researchers (Iashin and Rahtu, 2020) (Iashin and Rahatu, 2020) are
trying to include audio and speech modalities with videos to generate The availability of labeled datasets for image and video captioning
dense descriptions and it has certainly improved the performance. has been the main driving force behind the fast advancement in the area
of visual captioning. There are a number of datasets that are available
4. Datasets and Performance Evaluation Parameters for training, testing, and evaluation of images and video image and
video captioning problems. In this section different types of datasets
This section presents a review of popular datasets and evaluation from different domains for visual captioning are discussed.
metrics for image and video captioning which is summarized below:
4.1.1. Datasets for Image Captioning
i) In sub-section 4.1, popularly used datasets for visual captioning Datasets for image captioning methods differ in various perspectives
are discussed. The three most popular benchmark datasets, like the number of images in the dataset, the number of captions per
Flickr8K, Flickr30K, and MSCOCO along with various other image which provides a personal, cultural, or historical description of
datasets for image captioning are reported in sub-sections 4.1.1. the image, the format of images and captions, and the size of the image.
Whereas sub-section 4.2.2 briefs about popular datasets for video Datasets for image captioning are briefly discussed as under:
captioning which are categorized into different classes, namely:
cooking, movies, social media, etc. A summary of the characteristics i) MSCOCO Dataset: It (Lin, et al., 2014) is the most widely used
of the datasets used for image captioning and video captioning is dataset used for image recognition, segmentation, and image
listed in Tables 15 and 16, respectively. captioning. This dataset contains more than 3,00,000 images and
ii) Sub-section 4.2 highlights the most commonly used performance 2 million instances with 80 object categories. The dataset in
evaluation metrics namely: BLEU (B-@N), METEOR (M), ROGUE cludes 5 captions per image.
(R) SPICE (S), WMD, CIDEr (C), UMIC, etc. which are used to ii) Flickr30k Dataset: Flickr30K (Young, Lai, Hodosh, & Hock
measure the quality of generated captions compared to the enmaier, 2014) is a dataset used for automatic image captioning.
ground truth. This dataset contains 30 K images collected from Flickr, with 150
iii) In sub-section 4.3, the comparison of benchmark datasets intro K captions annotated by humans. For this dataset, researchers can
duced in 2014 to the present, is outlined by highlighting the choose their own choice of the split of images for training, testing,
performances of the state-of-the-art in terms of performance and validation purposes.
evaluation metrics (discussed in sub-section 4.2). This section
Table 15
Overview of Popular Image Captioning Datasets.
Ref. Dataset Source #images #train #val #test #captions objects GitHub Link
images images images
(Lin, et al., 2014) MSCOCO Images of 91 object types 328 K 83 K 118 K 41 K 5 Partial http://mscoco.org/
(Young, Lai, Hodosh, Flickr30K Contains people involved 30 K – – – 5 No https://nlp.cs.illinois.

& Hockenmaier, in everyday activities and edu/denotation.html
2014) events
(Hodosh, Young, & Flickr8K Contains “action” images of 8K 6K 1K 1K 5 No –
Hockenmaier, scenes featuring people and
2013) animals.
(Krishna, et al., 2017) Visual Genome Images from MSCOCO 108 K 80% 10% 10% – – https://visualgenome.
org/
(Kazemzadeh, ReferItGame Photographs of natural 19,894 – – – – http://referitgame.com/

Ordonez, Matten, & scenes
Berg, 2014)
(Sharma, Ding, Conceptual extracting and filtering 3.3 M – – – 1 – https://github.com/
Goodman, & Captions image caption annotations google-research-datasets/
Soricut, 2018) from billions of web pages. conceptual-captions
(Mathews, Xie, & He, SentiCap Several – – – 2000+ – https://users.cecs.anu.

2016) edu.au/%E2%88 %
BCu4534172/senticap.
html
(Zitnick, Parikh, & Abstract Scenes abstract images created 50 K 80% – 20% 6 Complete
Vanderwende, from collections of clipart
2013)
Instagram Instagram images mostly of 1.1 M 90% 105,000 5K 1 – –
Datasets celebrities.
(Gan, Gan, He, & Gao, Flickrstyle10K Flickr images 10 K 7K 2K 1K – – –
2017)
42
Table 16
Overview of Popular datasets for Video Captioning.
Ref. Dataset Source #classes #videos avg_len #clips Len #words Github Link/ Homepage
(sec) (hrs) Link
(Rohrbach, Amin, Andriluka, & MP-II Contains preparation 65 44 600 – 8 – –

Schiele, 2012) videos of 14 dishes by
12 participants
(Das, Xu, Doell, & Corso, 2013) YouCook YouTube videos of 6 2K 316 15.4 K 2.3 2711 http://www.cse.buffalo.
different people with edu/~jcorso/r/youcook
different recipes
(Zhou, Xu, & Corso, Towards YouCook-II YouTube videos offer 89 2K 316 15.4 K 176 2600 http://youcook2.eecs.
automatic learning of all challenges of open- umich.edu/
procedures from web domain videos.
instructional videos, 2018)
(Rohrbach, et al., 2012) TACoS Subset of MP-II 26 127 360 7206 15.8 28,292 http://tacodataset.org/
(Rohrbach, Rohrbach, Tandon, MP-II MD Hollywood Movies – 94 3.9 68,337 73.6 24,549 –
& Schiel, 2015)
(Torabi, Pal, Larochelle, & M− VAD DVS – 92 6.2 48,986 84.6 17,609 –
Courville, 2015)
(Gella, Lewis, & Rohrbach, VideoStory Social Media Videos – 20 K 123 K 396 – –
2018)
(Krishna, Hata, Ren, Fei-Fei, & ActivityNet Open-source videos – 20 K 180 – 849 – http://cs.stanford.edu/
Niebles, 2017) people/ranjaykrishna/
densevid/
(Zhou, Kalantidis, Chen, Corso, ActivityNet Social Media – 14,281 180 52 K – 607,339 https://github.com/
& Rohrbach, 2018) Entities facebookresearch/
ActivityNet-Entities
(Chen & Dolan, 2011) MSVD Movie Videos 218 1970 10 1970 5.3 – –
(Xu J., Mei, Yao, & Rui, 2016) MSR-VTT Commercial Video 20 7180 20 10 K 41.2 29,316 –
search engine
(Sigurdsson, et al., 2016) Charades Indoor home activities 157 9848 30 – 82.01 – http://allenai.org/plato/
videos charades/
(Zeng, Chen, Niebles, & Sun, VTW YouTube – 18,100 90 – 213.2 – http://aliensunmin.github.
2016) io/project/video-
language/
(Miech, et al., 2019) HowTo100M YouTube 12 1.2 M – 136 M – – https://www.di.ens.fr/

willow/research/
howto100m
(Wang, et al., 2020) VATEX Open-source videos 600 41.3 K 41.3 K – – – https://eric-xw.github.io/
vatex-website/index.html
iii) Flickr8K Dataset: Flickr8K (Hodosh, Young, & Hockenmaier, vii) SentiCap Dataset: SentiCap (Mathews et al., 2016) contains
2013) is a collection of 8 K images that are collected from Flickr. several thousands of images with captions containing emotions.
Test and development data for this dataset includes 1000 images The captions for images are distributed among two categories of
each while the training set contains 6 K images. Each image in the positive and negative sentiments which are constructed by au
dataset contains 5 captions provided by human annotators. thors by re-writing factual descriptions.
iv) Visual Genome Dataset: Visual Genome dataset (Krishna, et al., viii) Abstract Scenes Dataset: Abstract Scenes dataset (Zitnick, Par
2017) is another dataset used in image captioning. This dataset ikh, & Vanderwende, 2013) contains 10,000 clip-art images with
has separate captions for multiple regions in an image. The their descriptions. The description of clip-art images is provided
dataset has seven fundamental elements: location descriptions, in two groups, the first group contains a single sentence
gadgets, attributes, relationships, location graphs, scene graphs, description for a given image, while the second group contains
and question–answer pairs. The dataset has greater than 108 two alternative descriptions for a single image.
K images. Every image carries an average of 35 items, 26 attri ix) Instagram Dataset: Tran et al. (Tran, et al., 2016) and Park et al.
butes, and 21 pair-wise relationships between items. (Park, Kim, & Kim, 2017) created two image datasets from the
v) ReferItGame Dataset: ReferItGame (Kazemzadeh et al., 2014) images from the social-networking service Instagram. (Tran,
allows us to study various expressions in real-world scenes and et al., 2016) contains mostly images of celebrities and has about
allows us to perform experimental evaluations on various models. 10 K images whereas (Park, Kim, & Kim, 2017) contains about
This dataset contains 130,525 expressions which refer to 96,654 1.1 M images from about 6.3 K users.
distinct objects which are extracted from 19,894 images. x) FlickrStyle10KDataset: The dataset (Gan et al., 2017) has
vi) Conceptual Captions Dataset: The dataset (Sharma, Ding, Flickr images with stylized captions. The training data conta
Goodman, & Soricut, 2018) contains images of the order of ins 7,000 images. The validation and test data consist of 2,000
magnitude more images than the MSCOCO dataset. This dataset and 1,000 images, respectively. Each photo contains romantic,
represents a wide variety of images and caption styles that are humorous, and factual captions.
extracted from billions of web pages.
43
Table 17
An overview of Recent Methods, Datasets, and Evaluation Metrics for Image Captioning.
Dataset Ref Method Evaluation Metrics
MSCOCO (Karpathy & Fei-Fei, Deep Visual-Semantic Alignments for Generating Image CNN + bi-RNN BLEU, METEOR, CIDEr
Descriptions, 2015)
(Chen & Zitnick, 2015) LSTM + Bi-RNN BLEU, METEOR, CIDEr
(Vinyals, Toshev, Bengio, & Erhan, 2015) LSTM (CNN-RNN) BLEU, METEOR, CIDEr
(Jia, Gavves, Fernando, & Tuytelaars, 2015) LSTM + g-LSTM BLEU, METEOR, CIDEr
CNN-LSTM BLEU, METEOR
(Hendricks, et al., 2016) DCC F-1, BLEU, METEOR
(Yao, Pan, Li, Qiu, & Mei, Boosting image captioning with attributes, 2017) LSTM-A BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(You, Jin, Wang, Fang, & Luo, 2016) CNN-RNN + Attention BLEU, METEOR, CIDEr,
ROUGE-L
(Xu, et al., 2015) CNN-LSTM BLEU, METEOR
(Wu, Shen, Wang, Dick, & Hengel, 2018) CNN-RNN BLEU, METEOR
(Wang, Yang, Bartz, & Meinel, 2016) LSTM + bi-LSTM BLEU, R@K
(Donahue, et al., 2015) CNN-LSTM + LRCN BLEU, METEOR, CIDEr,
ROUGE-L
(Pu, et al., 2016) CNN + DGDN BLEU, METEOR, CIDEr
(Wu, Shen, Liu, Dick, & Hengel, 2016) CNN-RNN BLEU, METEOR, CIDEr
(Krause, Johnson, Krishna, & Li, 2016) HRN BLEU, METEOR, CIDEr
(Yao, Pan, Li, & Mei, Incorporating copying mechanism in image captioning for LSTM-C F-1, METEOR
learning novel objects, 2017)
(Venugopalan, et al., 2016) NOC F-1, METEOR
(Tavakoli, Shetty, Borji, & Laaksonen, 2017) Deep CNN + LSTM BLEU, METEOR, CIDEr,
ROUGE-L
(Chen L., et al., 2017) SCA-CNN BLEU, METEOR, CIDEr,
ROUGE-L
(Liu, MAo, Sha, & Yuille, 2017) CNN-LSTM BLEU, METEOR
Lu et al. (Lu, Xiong, Parikh, & Socher, 2017) LSTM + Spatial Attention BLEU, METEOR, CIDEr
Pedersoli et al. (Pedersoli, Lucas, Schmid, & Verbeek, 2017) CNN-RNN BLEU, METEOR, CIDEr
(Cheng, et al., 2017) CNN + RNN BLEU, METEOR
Liu et al. (Liu, Sun, Wang, Wang, & Yuille, 2017) CNN (LSTM) + RNN (LSTM) BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Zha, Liu, Zhang, Zhang, & Wu, 2022) CAVP BLEU, METEOR, CIDEr,
ROUGE-L
(Wu, Zhu, Jiang, & Yang, 2018) DNOC F-1, METEOR
(Chen C.-K., Pan, Sun, & Liu, 2018) LSTM with Domain Layer Norm BLEU, METEOR, CIDEr
(Gao, Wang, & Wang, 2018) CNN-RNN-SVM BLEU, METEOR, CIDEr
(Anderson, et al., 2018) R-CNN BLEU, METEOR, CIDEr,
Soft Attention ROUGE-L, SPICE
(Tan, Feng, & Ordonez, 2019) CNN-RNN BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Chen T., et al., 2018) Styled Factual-LSTM BLEU, METEOR, CIDEr
(Li, Yao, Pan, Chao, & Mei, 2019) LSTM-P F-1, SPICE, CIDEr, METEOR
(Agrawal, et al., 2019) Nocaps SPICE, CIDEr, METEOR, BLEU
(Feng, et al., 2019) Cascaded Revision Network F-1, METEOR
(Zhang, et al., 2020) LSTM and EE-LSTM BLEU, METEOR, CIDEr,
ROUGE-L
(Huang, Wang, Chen, & Wei, 2019) AoANet
(Zhao, Chang, & Guo, 2019) CNN-RNN (GRU, LSTM) BLEU, METEOR, CIDEr,
ROUGE-L
(Xian & Tian, 2019) Multimodal LSTM BLEU, METEOR, CIDEr,
ROUGE-L
(Nikolaus, Abdou, Lamm, Aralikatte, & Elliott, 2019) LSTM Embedding + Attention BLEU, METEOR, CIDEr,
RECALL
(Chen, et al., 2019) CNN + RNN with GAN BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Guo, Liu, Yao, Li, & Lu, 2019) MSCap BLEU, METEOR, CIDEr
(Xiao, Wang, Ding, Xiang, & Pan, 2019) DSE-LSTMS BLEU, METEOR, CIDEr
(Cornia, Stefanini, Baraldi, & Cucchiara, 2020) M2 -Transformer BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Shi, Zhou, Qiu, & Zhu, 2020) CGVRG BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Deng, Jiang, Lan, Huang, & Luo, 2020) LSTM + Adaptive Attention BLEU, METEOR
(Tian & Oh, 2020) LSTM + Attention BLEU, SPICE, CIDEr, ROUGE-L,
(Hu, et al., 2021) VIVO CIDEr, SPICE
(Tripathi, Nguyen, Guha, Du, & Nguyen, 2021) SG2Caps (GCN-LSTM) BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Yan, et al., 2021) Adaptive Attention + Vanilla BLEU, METEOR, CIDEr,
Transformer ROUGE-L, SPICE
(Zhang, Wu, Wang, & Chen, 2021) Parallel Attention Mechanism BLEU, METEOR, CIDEr,
ROUGE-L
(Mishra, Dhir, Saha, Bhattacharyya, & Singh, 2021) CNN + RNN BLEU, METEOR, CIDEr,
ROUGE-L
(Xiao, Xue, Shen, & Gao, 2022) Att-LSTM
44
BLEU, METEOR, CIDEr,

ROUGE-L, SPICE
(Liu, et al., 2022) CIIC BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Zeng, Zhang, Song, & Gao, 2022) S2 Transformer BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Jia, Wang, Peng, & Chen, 2022) SAET BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Herdade, Kappeler, Boakye, & Soares, 2020) Object Relation Transformer BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
Stanford Paragraph (Li, Liang, Shi, Feng, & Wang, 2020) Dual-CNN BLEU, METEOR, CIDEr
Dataset (Zha, Liu, Zhang, Zhang, & Wu, 2022) CAVP BLEU, METEOR, CIDEr,
ROUGE-L
(Shi, et al., 2021) BLEU, METEOR, CIDEr
(Guo, Lu, Chen, & Zeng, 2021) VTCM-Transformer BLEU, METEOR, CIDEr
(Johnson, Karpathy, & Fei-Fei., Densecap: Fully convolutional localization CNN-RNN AP, METEOR
networks for dense captioning, 2016)
(Yang, Tang, Yang, & Li, 2016) LSTM AP, METEOR
Visual Genome (Wang, Luo, Li, Huang, & Yin, 2018) CNN-LSTM + Attention BLEU, METEOR, CIDEr
(Kyriazi, Han, & Rush, 2018) SCST + Repetition Penalty BLEU, METEOR, CIDEr
(Zhang, et al., 2019) LSTM mAP
(Shi, Zhou, Qiu, & Zhu, 2020) (CGVRG) BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Kim, Oh, Choi, & Kweon, 2020) MTTSNet mAP, METEOR
(Tripathi, Nguyen, Guha, Du, & Nguyen, 2021) SG2Caps (GCN-LSTM) BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Mao, et al., 2015) m-RNN BLEU, R@4, mrank
(Vinyals, Toshev, Bengio, & Erhan, 2015) LSTM (CNN-RNN) BLEU, METEOR, CIDEr
(You, Jin, Wang, Fang, & Luo, 2016) CNN-RNN + Attention BLEU, METEOR, CIDEr,
ROUGE-L
Flickr30K (Pu, et al., 2016) CNN + DGDN BLEU, METEOR, CIDEr
(Donahue, et al., 2015) CNN-LSTM + LRCN BLEU, METEOR, CIDEr,
ROUGE-L
ROUGE-L
(Liu, MAo, Sha, & Yuille, 2017) CNN-LSTM BLEU, METEOR
(Lu, Xiong, Parikh, & Socher, 2017) LSTM + Spatial Attention BLEU, METEOR, CIDEr
(Cheng, et al., 2017) CNN + RNN BLEU, METEOR
(Xiao, Wang, Ding, Xiang, & Pan, 2019) DSE-LSTMS BLEU, METEOR, CIDEr
(Zhang, et al., 2020) LSTM and EE-LSTM BLEU
(Zhao, Chang, & Guo, 2019) CNN-RNN (GRU, LSTM) BLEU, METEOR, CIDEr,
ROUGE-L
(Xiao, Xue, Shen, & Gao, 2022) Att-LSTM BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
(Deng, Jiang, Lan, Huang, & Luo, 2020) LSTM + Adaptive Attention BLEU, METEOR
(Liu & Xu, Adaptive Attention-based High-level Semantic Introduction for CNN + LSTM + Attention BLEU, METEOR, CIDEr
Image Caption, 2020)
FlickerStyle10K/8K (Gan, Gan, He, & Gao, 2017) StyleNet BLEU, METEOR, CIDEr
(Wu, Zhao, & Luo, 2022) BLEU, METEOR, CIDEr
(Guo, Liu, Yao, Li, & Lu, 2019) MSCap BLEU, METEOR, CIDEr
(Zhao, Wu, & Zhang, 2020) MemCap BLEU, METEOR, CIDEr
SentiCap (Guo, Liu, Yao, Li, & Lu, 2019) MSCap BLEU, METEOR, CIDEr
(Wu, Zhao, & Luo, 2022) BLEU, METEOR, CIDEr
(Zhao, Wu, & Zhang, 2020) MemCap BLEU, METEOR, CIDEr
(Mao, et al., 2015) m-RNN BLEU, R@K, mrank
(Karpathy & Fei-Fei, Deep Visual-Semantic Alignments for Generating Image CNN + bi-RNN BLEU, METEOR, CIDEr
Descriptions, 2015)
(Jia, Gavves, Fernando, & Tuytelaars, 2015) LSTM + g-LSTM BLEU, METEOR, CIDEr
Flickr8K (Pu, et al., 2016) CNN + DGDN BLEU, METEOR, CIDEr
ROUGE-L
(Zhang, et al., 2020) LSTM and EE-LSTM BLEU
nocaps (Agrawal, et al., 2019) Nocaps BLEU, METEOR, CIDEr,
ROUGE-L, SPICE
45
(Hu, et al., 2021) VIVO BLEU, METEOR, CIDEr,

ROUGE-L, SPICE
(Cao, et al., 2020) METEOR, CIDEr, ROUGE-L
(Wu, Jiang, & Yang, Switchable Novel Object Captioner, 2023) S-NOC METEOR
Spot-the-dif (Park, Darrell, & Rohrbach, 2019) DUDA BLEU, METEOR, CIDEr, SPICE
(Kim, Kim, Lee, Park, & Kim, 2021) VACC BLEU, METEOR, CIDEr, SPICE
(Shi, Yang, Gu, Joty, & Cai, 2020) M− VAM BLEU, METEOR, CIDEr, SPICE
xi) Spot-the-diff: Spot-the-diff (Jhamtani and Kirkpatrick, 2018) each video. The dataset comprises 11,796 sentences that contain
dataset is the most popular dataset used for change image 17,334 action descriptions.
captioning tasks. This dataset contains 13,192 image pairs along v) MPII-MD dataset: MPII-Movie Description (Rohrbach, Rohrbach,
with corresponding human-provided text annotations stating the Tandon, & Schiel, 2015) Corpus contains transcribed audio de
differences between the two images. scriptions extracted from 94 Hollywood movies. The audio de
xii) CLEVR: CLEVR (Johnson, et al., 2016) is a synthetic visual scriptions track is an added feature in the dataset trying to
question-answering dataset. It contains images of 3D-rendered describe the visual content to help visually impaired persons.
objects. This dataset includes a training set with 70 k images These movies are subdivided into 68,337 clips with an average
and 700 k questions, a validation set with 15 k images and 150 k length of 3.9 s combined with 68,375 sentences. Every clip is
questions, a test set with 15 k images and 150 k object-related combined with a sentence that is extracted from the script of that
questions, as well as answers, scene graphs, and functional pro movie and the audio description data. The total period of the
grams for all the training and validation sets of images and dataset videos is almost 73.6 h with a vocabulary size of 653,467.
questions. vi) M− VAD Dataset: Montreal Video Annotation Dataset (M− VAD)
(Torabi et al., 2015) is based on the Descriptive Video Service
4.1.2. Datasets for Video Captioning (DVS). It consists of video clips from 92 different movies. A total
The availability of labeled datasets for the description of videos is the of 48,986 video clips are there with an entire time of 84.6 h. Each
main driving force behind the fast advancement of this research area. At clip is over 6.2 s on average and the total number of sentences is
the beginning of video data collection, the datasets were categorized 55,904, where some clips are associated with more than one
into cooking videos, makeup videos, movies, social media, etc. in most sentence. The vocabulary of the dataset spans about 17,609. The
of the datasets. Each video consists of a single-sentence description of dataset is split into training, testing, and validation as 38,949,
the video except for a few datasets which contain multiple sentences 5,149, and 4,888 video clips.
describing a particular video or a paragraph per video snippet. Video vii) VideoStory Dataset: Video Story (Gella, Lewis, & Rohrbach,
captioning datasets are briefly discussed below: 2018) comprises 20 K social media videos with multi-sentence
descriptions for each video. The dataset intends to address the
i) MP-II Cooking Dataset: Max Plank Institute for Informatics (MP- story narration or description of long videos that could not be
II) Cooking dataset (Rohrbach, Amin, Andriluka, & Schiele, illustrated with a single sentence, with at least one paragraph per
2012) contains videos of food preparation by 12 participants for video. The average number of temporally localized sentences per
14 dishes with 65 fine-grained cooking activities. The cooking paragraph is 4.67. The total number of 26,245 paragraphs in the
activities included in videos are “washing hands“, ”put in the dataset comprises 123 K sentences with an average of 13.32
bowl“, ”cut apart“, ”take out from drawer“ etc. MP-II dataset also words per sentence. The dataset has a training, test, and valida
comprises 44 videos with 888,775 frames and spans a total of 8 h tion split of 17908, 1011, and 999 videos respectively. This
of play length with an average length per clip of approximately dataset also proposes a blind test set that comprises 1039 videos.
600 s. viii) ActivityNet dataset: ActivityNet Captions dataset (Krishna et al.,
ii) YouCook Dataset: The YouCook dataset (Das, Xu, Doell, & Corso, 2017) contains 20 K videos with 100 K dense natural language
2013) incorporated 88 YouTube videos of different people with descriptions. The total time for this dataset is about 849 h. On
different recipes. The background kitchen scenes are different in average, each description is composed of 13.48 words and covers
the maximum videos. The training set contains 49 videos and the about 36 s of video. There are multiple descriptions for every
test set consists of 39 videos. Amazon Mechanical Turk (AMT) video and when combined, these descriptions cover 94.6% of the
was employed for human-generated multiple natural language content present in the entire video.
descriptions of each video. ix) ActivityNet Entities: ActivityNet Entities dataset (Zhou et al.,
iii) YouCook-II Dataset: YouCook-II (Zhou et al., 2018c) consists of 2018b) is the first video dataset with entities grounding and an
2000 YouTube videos distributed over 89 recipes. This dataset notations. This dataset is built on the training and validation
offers all challenges of open-domain videos like variations in splits of the ActivityNet Captions dataset (Krishna et al., 2017),
camera position, camera motion, and changing backgrounds. The but with different captions. The dataset comprises 14,281 anno
entire dataset spans a total playtime of 175.6 hrs and has a 2600 tated videos, 52 K video segments with at least one noun phrase
words vocabulary. The average length of each video is 316 s with annotated per segment, and 158 K bounding boxes with anno
600 s being the maximum. The dataset is randomly split into the tations. The dataset employs a training set of 10 K videos whereas
train, validation, and test sets with a ratio of 66%:23%:10% the validation set and testing set are split into 2.5 K and 2.5 K
respectively. videos.
iv) TACoS: Textually Annotated Cooking Scenes (TACoS) (Rohr x) MSVD Dataset: Microsoft Video Description (MSVD) (Chen &
bach, et al., 2012) is a subset of MP-II Composites with 212 high- Dolan, 2011)contains 1,970 YouTube clips with human-
resolution videos of 41 different cooking activities. This dataset annotated sentences. This dataset was annotated by AMT
contains those activities that involved the manipulation of workers. The duration of each video is typically between 10 and
cooking ingredients and has at least 4 videos for the same activ 25 s with only one activity. The dataset comprises human-
ity. AMT workers were engaged to align the sentences and asso generated descriptions in languages like Chinese, English,
ciated videos. 20 different textual descriptions were grouped for
46
Table 18
An overview of Recent Methods, Datasets, and Evaluation Metrics for Video Captioning.
MSVD (Venugopalan, Anne, Mooney, & Saenko, 2016) S2VT (LSTM) METEOR, BLEU
(Xu, Xiong, Chen, & Corso, 2015) Deep Video Model and Joint Embed mRank
(Nian, et al., 2017) VRM (LSTM) BLEU, CIDEr, METEOR
(Yao, et al., 2015) CNN-RNN + Temporal Attention BLEU, CIDEr, METEOR
(Pan, Mei, Yao, Li, & Rui, 2016) LSTM-E METEOR, BLEU
(Pan, Yao, Li, & Mei, 2016) LSTM-TSA BLEU, METEOR, CIDEr
(Xu, Venugopalan, Ramanis, Rohrbach, & Saenko, 2015) CNN-FCN + LSTM BLEU, METEOR
(Jin & Liang, 2016) LSTM-RNN & CNN METEOR
(Venugopalan, et al., 2015) LSTM METEOR
(Pan, Xu, Yang, Wu, & Zhuang, 2016) HRNE BLEU, METEOR
(Gao, Guo, Zhang, Xu, & Shen, 2017) LSTM + Attention BLEU, METEOR
(Liu A.-A., et al., 2017) HVMC BLEU, METEOR, CIDEr,
ROUGE-L
(Bin, et al., 2018) Bi-LSTM METEOR
(Long, Gan, & Melo, 2016) LSTM + Multi-faceted Attention BLEU, METEOR, CIDEr
(Yang, et al., 2018) LSTM-GAN
(Zhao, Li, & Lu, 2018) biLSTM + Attention BLEU, METEOR, CIDEr
(Chen, Wang, Zhang, & Huang, 2018) PickNet BLEU, ROUGE-L, METEOR,
CIDEr
(Mehri & Sigal, 2018) Middle-out decoder BLEU, METEOR, CIDEr,
ROUGE-L
(Chen J., et al., Temporal Deformable Convolutional Encoder-Decoder Networks for Video TDConvED METEOR, CIDEr, BLEU
Captioning, 2019)
(Wei, Mi, Hu Zhen, & Chen, 2020) LSTM + sliding window BLEU, METEOR, CIDEr
(Wu & Han, 2020) LSTM + MemNet BLEU, METEOR, CIDER
(Jin, Huang, Chen, Li, & Zhang, 2020) SBAT BLEU, METEOR, CIDEr,
ROUGE-L
MSR-VTT (Gao, Guo, Zhang, Xu, & Shen, 2017) LSTM + Attention BLEU, METEOR, CIDEr
(Liu A.-A., et al., 2017) HVMC BLEU, METEOR, CIDEr,
ROUGE-L
(Pasunuru & Bansal, 2017) CIDEnt BLEU, METEOR, CIDEr,
ROUGE-L
CIDEnt
(Phan, Henter, Miyao, & Satoh, 2017) BLEU, METEOR, CIDEr,
ROUGE-L
(Bin, et al., 2018) Bi-LSTM METEOR
(Wang, Chen, Wu, Wang, & Wang, 2018) HRL BLEU, METEOR, CIDEr,
ROUGE-L
(Long, Gan, & Melo, 2016) LSTM + Multi-faceted Attention BLEU, METEOR
(Yang, et al., 2018) LSTM-GAN BLEU, METEOR
(Chen, Wang, Zhang, & Huang, 2018) PickNet BLEU, ROUGE-L, METEOR,
CIDEr
(Chen J., et al., Temporal Deformable Convolutional Encoder-Decoder Networks for Video TDConvED METEOR, CIDEr, BLEU
Captioning, 2019)
(Wei, Mi, Hu Zhen, & Chen, 2020) LSTM sliding window BLEU, METEOR, CIDEr
(Wu & Han, 2020) LSTM + MemNet BLEU, METEOR, CIDER
(Jin, Huang, Chen, Li, & Zhang, 2020) SBAT BLEU, METEOR, CIDEr,
ROUGE-L
(Fang, Gokhale, Banerjee, Baral, & Yang, 2020) V2C-Transformer BLEU, METEOR, ROUGE-L
(Luo, et al., 2020) UniVL Recall
(Zhu & Yang, 2020) ActBERT with TNT Recall
MPII-MD (Venugopalan, Anne, Mooney, & Saenko, 2016) S2VT (LSTM) METEOR
(Pan, Mei, Yao, Li, & Rui, 2016) LSTM-E METEOR
(Pan, Yao, Li, & Mei, 2016) LSTM-TSA METEOR
(Shin, Ohnishi, & Harada, 2016) CNN-RNN BLEU, METEOR, CIDEr
(Nian, et al., 2017) VRM (LSTM) METEOR
YouCook II (Luo, et al., 2020) UniVL BLEU, METEOR, CIDEr,
ROUGE-L
(Sun, Baradel, Murphy, & Schmid, 2019) VideoBERT BLEU, METEOR, CIDEr,
ROUGE-L
(Zhu & Yang, 2020) ActBERT with TNT BLEU, METEOR, CIDEr,
ROUGE-L
(Sur, 2020) SACT BLEU, METEOR,
(Zhou, Zhou, Corso, Socher, & Xiong, 2018) CNN + Masked Transformer BLEU, METEOR
Charades (Zhao, Li, & Lu, 2018) biLSTM + Attention BLEU, METEOR, CIDEr,
ROUGE-L
(Wei, Mi, Hu Zhen, & Chen, 2020) LSTM model in a sliding window BLEU, METEOR, CIDEr
manner
(Wang, Chen, Wu, Wang, & Wang, 2018) HRL BLEU, METEOR, CIDEr,
ROUGE-L
(Song, Chen, & Jin, 2021) BLEU, METEOR, CIDEr
MVAD (Venugopalan, Anne, Mooney, & Saenko, 2016) S2VT (LSTM) METEOR
(Nian, et al., 2017) VRM (LSTM) METEOR
47

(Pan, Yao, Li, & Mei, 2016) LSTM-TSA METEOR
(Yang, et al., 2018) LSTM-GAN METEOR
(Pan, Xu, Yang, Wu, & Zhuang, 2016) HRNE (LSTM-decoder) BLEU, METEOR
(Pan, Mei, Yao, Li, & Rui, 2016) LSTM-E METEOR
ActivityNet (Sur, 2020) SACT BLEU, METEOR,
(Sun, Myers, Vondrick, Murphy, & Schmid, 2019) CBT (Extended BERT) BLEU, METEOR, CIDEr,
ROUGE-L
(Zhang, Xu, Ouyang, & Tan, 2020) DaS with LSTM BLEU, METEOR, CIDEr,
ROUGE-L
(Suin & Rajagopalan, 2020) CNN + LSTM + Self Attention BLEU, METEOR, CIDEr
(Iashin & Rahtu, Multi-modal Dense Video Captioning, 2020) MDVC BLEU, METEOR
(Iashin & Rahatu, A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Bi-modal Transformer BLEU, METEOR
Transformer, 2020)
(Song, Chen, & Jin, 2021) PCUV BLEU, METEOR, CIDEr
(Xiong, Dai, & Lin, 2018) LSTM net BLEU, METEOR, CIDEr,
ROUGE-L
(Zhou, Zhou, Corso, Socher, & Xiong, 2018) CNN + Masked Transformer METEOR
VATEX (Shi, et al., 2022) VTAR BLEU, METEOR, CIDEr,
ROUGE
(Gao, et al., 2022) D 2 -Transformer BLEU, METEOR, CIDEr,
ROUGE
(Lin, et al., 2022) SWINBERT BLEU, METEOR, CIDEr,
ROUGE
(Zhang, et al., 2020) ORG-TRL BLEU, METEOR, CIDEr,
ROUGE
(Chen J., et al., Retrieval Augmented Convolutional Encoder-Decoder Networks for Video R-ConvED BLUE, CIDEr, METEOR
Captioning, 2022)
German, etc. On average, there are 41 single-sentence de are the main distinctive characteristics of VATEX. First, it con
scriptions per clip. tains descriptions in both English and Chinese which can assist
xi) MSR-VTT: MSR-Video to Text (MSR-VTT) Xu et al., 2016is one of several bilingual investigations limited by datasets and are only
the largest VC datasets with 7180 videos sub-divided into 10,000 available in one language. Second, VATEX contains the most clip-
clips. It contains a wide collection of open-domain videos from 20 sentence pairs of any corpus, with each video clip annotated with
different categories. These clips are further split into 6513 numerous distinct sentences, each caption being distinct. Third,
training, 2990 test, and 497 validation videos. Each video com VATEX offers 600 different human activities in its more thorough
prises 20 reference captions annotated by AMT workers. yet representative video material. Furthermore, the lexical rich
xii) Charades: This dataset (Sigurdsson, et al., 2016) contains 9848 ness of the English and Chinese corpora in VATEX allows for more
videos of daily indoor household activities. These videos are natural and varied caption production.
recorded by 267 AMT workers from three different continents.
Videos are recorded in 15 different indoor scenes and restricted to 4.2. Performance evaluation parameters or metrics
the use of 46 objects and 157 action classes only. The dataset
comprises 66,500 annotations describing 157 actions. It also The performance of architecture greatly depends on the selection of
provides 41,104 labels to its 46 object classes. Moreover, it con the right evaluation metrics. In this section, the commonly used evalu
tains 27,847 descriptions covering all the videos. The videos ation metrics are discussed in detail.
depict daily life activities with an average duration of 30 s. The i) BLEU metric: BLEU stands for Bilingual Evaluation Understudy. It
dataset is split into 7985 and 1863 videos for training and test is the most popular metric for evaluation that evaluates the performance
purposes respectively. of machine translation systems. This metric was proposed by IBM (K, p.,
xiii) VTW: Video Titles in the Wild (VTW) (Zeng et al., 2016) contains s, r., t, w., & w.j., z. , 2002) The central idea behind BLEU is “the better
18,100 video clips with an average of 1.5 min duration per clip. the BLEU score is if the closer is the machine translation to the profes
Each clip is described in one sentence only. However, it in sional human translation, yields a higher translation quality”. It is used
corporates a diverse vocabulary, where on average one word for the comparison and counting of the number of co-occurrences and
appears in not more than two sentences across the whole dataset. depends on n − gram precision that computes per-corpus n − gram co-
The dataset is proposed for video title generation as opposed to occurrence, n ∈ [1, 4]. The BLEU score is calculated by:
video content description but can also be used for language-level
∑N
understanding tasks including video question answering. BLEU = BP.exp( wn logpn ); BLEU ∈ [0, 1] (5)
xiv) HowTo100M dataset: HowTo100M (Miech, et al., 2019) dataset n=1
is a large-scale dataset that contains 15 years of videos. This
dataset contains 136 M video clips with captions sourced from where N is usually set to 4 wn is weight and is set to 1/N and BP is a
1.2 M YouTube videos from domains such as hand-crafting, brevity penalty and pn is modified precision given by the equations (6),
personal care, fitness, gardening, art, entertainment, elec (7(a)), and (7(b)):
tronics, and cooking. Narrations for each video in the dataset are {
1; lc > ls
available from subtitles automatically downloaded from BP = (6)
e(1− lc /ls) ; lc ≤ ls
YouTube.
xv) VATEX: VATEX (Wang, et al., 2020) is a new, large-scale, and where lc andls represents the length of the candidate’s sentence (ci ) and
multilingual video description dataset that includes 41,250 reference sentence (sij ).
videos and 825,000 Chinese and English captions. The following
48
∑ ∑
gram∈C Countclip (n − gram) for a large value of N the degree of discrimination is not high but is very
(7a)
C∈(Candidates)
pn = ∑ ∑n− low. ROUGE-L is based on LCS (longest common subsequence). Let
C∈Candidates n− gram∈C Count(n − gram)
AandB be the two given sentences, if C is a subsequence of AandB, then
Countclip (n − gram) = min{Count(n − gram), MaxRefcount(n − gram) } sequence C is a common subsequence of AandB. LCS of AandB is the
(7b) subsequence that maximizes the length of C. Recall, precision, and
F − score are evaluated in terms of LCS as:
The main advantage of the BLEU metric is that it has been adopted
across the translation industry as a measure of MT quality that is R=
LCS(A, B)
,P =
LCS(A, B)
andF =
(1 + μ2 )RP
(11)
designed for different language groups. Also, it considers n-grams m n R + μ2 P
instead of words. However, the BLEU metric considers each matched Where m is the length of reference summary AandB be the candidate
n-gram equally, which is one of its major to drawbacks. Also, BLEU summary with length n.F measures the similarity between AandB. This
scores do not handle morphological-rich languages well and do not map method requires only a simple matching following the occurrence of
well human judgments. words.
(ii) METEOR Metric: METEOR or Metric for Evaluation of Trans ROUGE − Wb) is weighted LCS which introduces a weighted coeffi
lation with Explicit Ordering is the evaluation parameter that evaluates cient W which represents the length of the longest continuously
the machine translation output and is a better correlation with human matching common subsequence.
judgment. METEOR (Banerjee & Lavie, 2005) was proposed in 2005 ROUGE − Wc) is more distinguishable than LCS method as the sen
after finding the significance of the recall rate. This evaluation param tences with more consecutive matches are weighted in comparison to
eter somehow addresses some of the flaws inherited by BLEU and is a those with fewer matches.
measure that is based on a single-precision weighted average. Meteor ROUGE − Sd) is a new concept introduced that is based on the
metric calculates the accuracy, recall rate, and weighted F-score or Fmean concept of skip-bigram. Skip-bigram is any pair of words with random
for all cases of matching word, stem, and synonym based on the gaps in sentence orders. The recall and precision, in this case, are
matching unigrams and a penalty function for incorrect word order. evaluated as the ratio of the total number of possible bigrams C(n, 2),
Let us consider m is the number of maps between two texts then where C is known as the combination function. The main disadvantage
precision (P) and recall (R) can be given as m/candm/r where c and r are of this method is that under unlimited jumping distance, many mean
the candidates and the reference lengths. Therefore, ingless words occur which can be reduced by setting a particular limit
PR for the maximum jump distance.
Fmean = (8a) ROUGE − SUe) is skip-bigram plus unigram-based co-occurrence
αP + (1 − α)R
statistics. If a sentence does not contain any bigram overlap it will not
|m| provide weight to any sentence. Thus ROUGE-SU is considered as the
P=∑ (8b)
k hk (ci )
place of ROUGE − S which is an extension to ROUGE − S. ROUGE Metric
is appropriate for single-document evaluation and provides good per
|m| formance in short summaries but has a problem in the evaluation of
R=∑ (8c)
k hk (sij ) multi-document text summaries.
Though ROUGE is a significant evaluation metric it does not carter
for words that have the same meaning as it measures syntactical matches
rather than semantics.
To account for word order in candidate a penalty function is given by
(iv) CIDEr Metric: Consensus-based Image Description Evaluation
the relation:
(Vedantam et al., 2015) is the metric that measures the similarity of the
generated captions against a set of sentences written by humans which
c means they provide a strong correlation with the human evaluation.
Pfunction = γ( )∅ (9) Evaluation is carried out in terms of saliency, grammar, and accuracy.
m
This parameter is based on a protocol that is consensus-based and
Where, c is the number of matching chunks and m is the total number
measures the similarity between the sentences to be evaluated and a set
of matches, and also γ, αand∅ are the evaluation parameters whose
of consensus image descriptions. CIDEr considers each sentence as a
default values are 0.5, 3, and 3. METEOR evaluation considers recall and
“document” made up of a set of n-grams. This metric encodes the fre
accuracy based on the entire corpus and also includes some features
quencies of candidate sentence n-grams that are present in reference
which are not included in other metrics like synonym matching.
sentences. TF-IDF (Term frequency-inverse document frequency) is used
METEOR = (1 − Pfunction )Fmean (10) to calculate the weight for every n-gram which further evaluates the
similarity between each reference caption and the caption generated by
(iii) ROUGE Metric: ROUGE stands for Recall Oriented Understudy the model by calculating the average cosine distance of the TF-IDF
for Gisting Evaluation, ROUGE is an automatic summarization method vectors. Mathematically, CIDEr is expressed as:
that came into existence in 2004 and was proposed by Chin-Yew Lin(Lin,
2004) . This is an n-gram recall-rate method that evaluates abstracts ( ) hk (sij ) |I|
gk sij = ∑ log(∑ ∑ ) (12)
based on the co-occurrences information of n-grams. The basic idea for wl ∈ξ hl (s ij ) Ip ∈I min(1, q hk (spq )
ROUGE includes n-grams, word-pairs and word-sequences which are
counted to evaluate the quality of abstracts. With the use of summaries 1 ∑ gn (ci ).gn (sij )
CIDErn (ci , sij ) = (13)
generated by experts, one can measure the robustness and stability of the m j ‖gn (ci )‖‖gn (sij )‖
system. This metric consists of ROUGE − NN ∈ [1, 4], ROUGE − L, ( ) ( )
ROUGE − W, and skip-bigram co-occurrence statistics (ROUGE − S) and Where gk sij represents TF-IDF weight for each n-gram. Also, hk sij
ROUGE − SU which is an extension of ROUGE − S. represent the number of occurrences of an n-gram sij and hk (ci ) repre
ROUGE − Na) is a method that is mainly used for short-summary sents the number of occurrences of an n-gram ωk in ci . Si = {si1 si2 si3 ⋯⋯
assessment or single-document and is based on n-gram co-occurrence sim } indicates the set of reference captions. ξ is a list of all n-grams and |I|
statistics. For any n, the total number of n-grams is counted across all represents the number of images in the dataset. Equation (13) represents
reference summaries and is compared with the candidate summary. This the CIDEr score for n-grams of length n. When using multiple lengths of
method has many advantages as it is concise and spontaneous. However, n-grams to catch grammatical features and richer semantic information,
49
the CIDEr score can be calculated as: Also, n-gram matching is inadequate for caption evaluation. In case,
when some words are replaced with their synonyms the metric scores
( ) ∑
N
CIDEr ci , sij = ωn CIDErn (ci , sij ) (14) reduce, BLEU and CIDEr being most affected. This is due to the failure to
n=1 match synonyms. Also, reducing the length of sentences has less effect
on CIDEr when compared to BLEU, METEOR, and ROUGE. Though
In comparison to BLEU, CIDEr does not treat each matching word
CIDEr has a better association with human judgments yet it is difficult to
equally but has a focused treatment that helps in the improvement of the
optimize. To overcome these issues, (Anderson, Fernando, Johnson, &
accuracy of existing measures. The most popular version of CIDEr in
Gould, 2016) introduced the SPICE metric based on semantic concepts.
image and video description evaluation is CIDEr-D, which incorporates a
This metric has better interrelation with human assessment quality as it
few modifications in the originally proposed CIDEr to prevent higher
is found robust to word order changes. Furthermore, SPICE can also
scores for the captions that badly fail in human judgments.
answer the questions such as which caption generator best understands
(v) WMD Metric: The Word Mover’s Distance (WMD) (Kusner et al.,
colors? and, can caption generator count?
2015) makes use of word embeddings which are semantically mean
In (Liu et al., 2018), a combination of SPICE and CIDEr, SPIDEr is
ingful vector representations of words learned from text corpora. Dis
presented. The SPIDEr used the policy gradient method to optimize the
similarities between two text documents are measured by the WMD
metrics as is considered a good choice by human raters. WMD exploits
evaluation metric. Captions generated can have the same semantic
word embeddings that are semantically meaningful vector representa
meanings irrespective of their words and it is also possible for multiple
tions of words learned from text corpora. It is inconsiderate of word
captions to have the same attributes and objects but still have different
order or synonym matching and gives a high correlation to human
semantic meanings. WMD was proposed to address this problem. With
judgments when compared with other evaluation metrics. Recently, the
the ease of distributed vector representation of words, word embeddings
UMIC metric is proven to be more effective and generic as it does not
are easier to good at capturing semantic meanings. Bag-of-words his
need reference captions for evaluation and outperforms the previous
tograms are used for the representation of generated description or
evaluation metrics by demonstrating the effectiveness and generaliza
caption which includes all but the start and stop words. The magnitude
tion ability. Also, to measure the stylishness of the sentence generated
of each bag-of-words histogram is then normalized. To account for se
StyleCIDEr and Onestyle metric is proposed (Li & Harrison, 2022) with
mantic similarities that exist between pairs of words, the WMD metric
more emphasis being imposed on the stylized words.
uses the Euclidean distance in the word2vec embedding space. The
distance between two documents or captions is then defined as the cost
required to move all words between captions. WMD metric is less sen 4.3. Comparison of Popular Datasets and Evaluation Metrics
sitive to synonym swapping or order of words in comparison to BLEU,
ROGUE, and CIDEr and WMD gives a high correlation against human The sub-section presents an overview of deep-learning approaches,
judgments. in Tables 17 and 18, discussed in this survey, and different datasets and
(vi) SPICE Metric: Semantic Propositional Image Captioning Eval Evaluation metrics employed by each of these approaches for image and
uation (SPICE) (Anderson, Fernando, Johnson, & Gould, 2016) was video captioning. From Table 17, it is evident that more recent ap
introduced in 2016 and is the latest evaluation metric for image and proaches use benchmark datasets (MSCOCO, Flickr8K, and Flickr30K)
video descriptions which is capable of measuring similarity between the and employ evaluation metrics that perform well in terms of correlation
scene graph tuples generated from the descriptions by machine and the with human judgments. Datasets mainly employed for dense image
ground truth. The semantic scene graph encodes objects, their attri captioning and image paragraph generation are recently developed Vi
butes, and relationships through a dependency parse tree. A scene graph sual Genome and Stanford Paragraph datasets. Similarly, for change
tuple of a caption k contains attributes A(k), relations R(k) and object image captioning tasks, Spot-the-diff and CLEVR Johnson et al., 2016
classes 0(k) are represented as: are widely employed. Further, we conclude that very few works have
been reported for Abstract Scenes, Instagram Dataset, ReferIt, Egoshots,
G(k) =< A(k), R(k), O(k) > (15) FlickrNYC, etc, and some datasets for stylized image captioning namely,
Like METEOR, SPICE also uses WordNet to find and treat synonyms SentiCap, FlickrStyle8K/10 K which contains romantic, humorous, and
as positive matches… SPICE can also tell us about ‘which caption- factual captions. Similarly, an overview of different datasets employed
generator best understands colors?’ and ‘can caption generators for different approaches discussed in this survey for video captioning is
count? Moreover, SPICE captures human judgments better when presented in Table 18. In the beginning, M− VAD, MPII-MD, YouCookII,
compared to other evaluation metrics. For instance, in the sentence and many others were developed which contain videos of specific cat
‘‘black cat swimming through river”, the failure case could be the word egories like cooking, makeup, movies, etc. The most recently developed
“swimming” being parsed as “object” and the word “cat” parsed as open-domain datasets for video captioning MSVD, MSR-VTT, How
“attribute” resulting in a very bad score. To100M, VATEX, and many others made a breakthrough in this field.
(vii) UMIC Metric: UMIC (Lee et al., 2021) metric also known as an Further, a significant amount of work has been discussed on the Acti
unreferenced metric for image captioning does not require reference vityNet captions dataset which provides a dense description of videos.
captions for evaluation of captions generated by images using UNITER As shown in Table 19, the performance of recent deep-learning
(Chen, et al., 2020). Negative captions are constructed using the refer techniques on the Visual Genome dataset is provided. The reported re
ence captions with the help of pre-defined rules and afterward, UNITER sults in bold represent the best performance for a given method. In
is fine-tuned which distinguishes the reference captions and synthetic Table 20a, comparison results for the ImageNet dataset for novel object-
negative captions to develop UMIC. Though improvements can be made based captioning of images are shown. F-1 score for different methods is
as UMIC outputs very low scores in some cases also UMIC outperforms an additional advantage for evaluating the performance on ImageNet.
the previous evaluation metrics by demonstrating effectiveness and (Li et al., 2019) provide the best F-1, CIDEr, SPICE, and METEOR scores
generalization ability. of 60.9, 88.3, 16.6, and 23.4 respectively. Further, Table 20 (b) provides
comparison results for recent novel-image captioning-based models on
4.2.1. Discussion the nocaps dataset. Furthermore, Table 22a, and Table 22b provide a
Performance evaluation metrics assess the suitability of a caption to comparison of different state-of-the-art techniques for benchmark
the visual input by comparing how well the candidate caption matches MSCOCO in terms of BLEU-n, METEOR, CIDEr, and ROUGE reported for
reference captions. Most of the existing visual captioning metrics such as single sentence IC and dense IC task. Table 23 provided state-of-the-art
BLEU, METEOR, ROUGE, and CIDEr are rooted in n-gram matching. results on Flickr8K/30 K datasets. Wu et al., (2018, March). obtained the
best results by generating the descriptions by using semantic concepts
50
Table 19
Performance of Visual Genome dataset for Dense or Paragraph Image Captioning Task.
Dataset Ref Method B-1 B-2 B-3 B-4 METEOR CIDEr
Visual (Wang, Luo, Li, Huang, & Yin, 2018) Encoder-Decoder – – 11.7 6.6 13.9 17.3
Genome (Kyriazi, Han, & Rush, 2018) Based IC 43.54 27.44 17.33 10.58 30.63 17.86
(Johnson, Karpathy, & Fei-Fei., Densecap: Fully convolutional localization – – – – 30.5 –
networks for dense captioning, 2016)
(Kim, Oh, Choi, & Kweon, 2020) – – – – 18.73 –
Table 20a
Performance of deep-learning techniques on ImageNet dataset for Novel-object Based Image Captioning Task.
Dataset Ref F-1 CIDEr SPICE BLEU METEOR
ImageNet (Hendricks, et al., 2016) 33.60 – – 64 20.71

(Yao, Pan, Li, & Mei, Incorporating copying mechanism in image captioning for learning novel objects, 2017) 55.66 – – – 23
(Venugopalan, et al., 2016) 45.80 – – – 20.04
(Li, Yao, Pan, Chao, & Mei, 2019) 60.9 88.3 16.6 – 23.4
(Feng, et al., 2019) 64.08 – – – 21.31
Table 20b
Performance of deep-learning techniques on MOCAPS dataset for Novel-object Based Image Captioning Task.
Dataset Ref B-1 B-4 CIDEr SPICE ROUGE-L METEOR
NOCAPS (Agrawal, et al., 2019) 73.4 12.9 61.5 9.7 48.7 22.1
(Hu, et al., 2021) – – 86.6 12.4 – –
(Cao, et al., 2020) – – 109.7 20.2 – 27.2
(Wu, Jiang, & Yang, Switchable Novel Object Captioner, 2023) – – – – – 21.88
Table 21
Performance of deep-learning techniques on Stanford Paragraph dataset for Paragraph Image Captioning Task.
Dataset Ref B-1 B-2 B-3 B-4 METEOR CIDEr
Stanford Paragraph Dataset (Li, Liang, Shi, Feng, & Wang, 2020) 41.6 24.4 14.3 8.6 15.6 17.4
(Zha, Liu, Zhang, Zhang, & Wu, 2022) 42.01 25.86 15.33 9.26 16.83 21.12
(Guo, Lu, Chen, & Zeng, 2021) 40.93 25.51 15.94 9.96 16.88 26.15
(Shi, et al., 2021) 44.32 25.86 14.80 8.33 16.89 21.41
for both datasets. Similarly from Table 23, we conclude that (Zhao, metrics. Performance of different state-of-the-art (Guo, Liu, Yao, Li, &
Chang, & Guo, 2019) provides the highest results for BLEU-n and Lu, 2019) (Chen et al., 2018d) for Stylized based image captioning
METEOR metrics while (Zhang, et al., 2020) provide the highest results evaluated on MSCOCO, SentiCap, and FlickrSyle10K datasets is depicted
for CIDEr and ROUGE on MSCOCO and (Chen et al., 2017) provides the in Table 25a and Table 25b with commonly used evaluation metrics
best results for Flickr8K/30 K. Furthermore, Tables 21 and 24 provide a performance of (Zhao et al., 2020) (Gan et al., 2017) (Guo, Liu, Yao, Li,
comparison of the state-of-the-art techniques for the Stanford Paragraph & Lu, 2019) (Chen et al., 2018d) evaluated for positive and negative
dataset and Spot-the-diff dataset with the commonly used evaluation styles collected from SentiCap dataset. Whereas, in Table 25(b) the
Table 22a
Performance of deep-learning techniques on benchmark MSCOCO dataset for Single Sentence IC Task.
Dataset Ref Captioning Method B-1 B-2 B-3 B-4 M C R S
(Wu, Shen, Wang, Dick, & Hengel, 2018) Encoder-Decoder Based IC 74.0 56.0 42.0 31.0 26.0 – – –
MSCOCO (Pu, et al., 2016) 72.0 52.0 37.0 28.0 24.0 90.0 – –
(Liu, MAo, Sha, & Yuille, 2017) – 37.2 27.6 24.78 – – –
(Gao, Wang, & Wang, 2018) 67.2 49.2 35.5 26.1 22.3 76.0 – –
(Herdade, Kappeler, Boakye, & Soares, 2020) 80.5 – – 38.6 28.7 128.3 58.4 22.6
(Xiao, Xue, Shen, & Gao, 2022) 78.8 63.8 50.4 39.5 28.8 121.1 58.4 21.8
(Liu, et al., 2022) 81.7 – – 40.2 29.5 133.1 59.4 23.2
(Zeng, Zhang, Song, & Gao, 2022) 81.1 – – 39.6 29.6 133.5 59.1 23.2
(Jia, Wang, Peng, & Chen, 2022) – – – 39.1 28.9 129.6 58.7 22.6
(Wang, Yang, Bartz, & Meinel, 2016) 67.2 49.2 35.2 24.4 – – – –
(Anderson, et al., 2018) 79.8 36.3 27.7 120.1 56.9 21.4
(Zhang, et al., 2020) 73.6 56.9 43.0 32.5 26.0 101.9 54.3 –
(Chen L., et al., 2017) 71.9 54.8 41.1 31.1 25.0 – – –
(Zhao, Chang, & Guo, 2019) Multimodal Learning Based IC 73.8 57.0 43.2 32.7 26.1 101.8 54.1 –
(Mao, et al., 2015) 67.0 49.0 35.0 25.0 – – – –
(Cheng, et al., 2017) 71.0 51.3 37.2 27.1 23.3 – – –
(Nikolaus, Abdou, Lamm, Aralikatte, & Elliott, Compositional Architecture Based 36.6 – – – 27.4 105.3 – 20.9
2019) IC
(Tian & Oh, 2020) 77.2 – – 33.0 – 108.9 59.4 20.4
51
Table 22b
Performance of deep-learning techniques on benchmark MSCOCO dataset for Dense IC Task.
Dataset Ref B-1 B-2 B-3 B-4 M C R S
MSCOCO (Cornia, Stefanini, Baraldi, & Cucchiara, 2020) 80.8 – – 39.1 29.2 131.2 58.6 22.6
(Zha, Liu, Zhang, Zhang, & Wu, 2022) 80.1 64.7 50.0 37.9 28.1 121.6 58.2 –
(Krause, Johnson, Krishna, & Li, 2016) 41.9 24.1 14.2 8.69 15.95 13.52 – –
Table 23
Performance of Single Sentence IC deep-learning techniques on benchmark Flickr8K and Flickr30K datasets.
Dataset Ref Captioning Method B-1 B-2 B-3 B-4 METEOR
Flickr8K (Xu, et al., 2015) Encoder-Decoder Based IC 67.0 45.7 31.4 25.7 20.3
(Wu, Shen, Wang, Dick, & Hengel, 2018) 74.0 54.0 38.0 27.0 –
(Pu, et al., 2016) 72.0 52.0 36.0 25.0 –
(Sharma, Dhiman, & Kumar, 2022) 70.2 49.1 35.9 26.6 –
(Jia, Gavves, Fernando, & Tuytelaars, 2015) 64.7 45.9 31.8 21.6 20.1
(Wang, Yang, Bartz, & Meinel, 2016) 65.5 46.8 32.0 21.5 –
(Zhang, et al., 2020) 59.8 40.8 27.5 18.4 –
(Chen L., et al., 2017) 68.2 49.6 35.9 25.8 22.4
(Karpathy & Fei-Fei, Deep Visual-Semantic Alignments for Generating Image 57.9 38.3 24.5 16.0 –
Descriptions, 2015) Multimodal Learning Based
(Chen & Zitnick, 2015) IC – – – 13.1 16.9
(Zhao, Chang, & Guo, 2019) 64.5 46.2 32.7 22.7 20.6
(Mao, et al., 2015) 56.5 38.6 25.6 17.0 –
(Xu, et al., 2015) Encoder-Decoder Based IC 69.9 43.9 29.6 19.9 18.4
(Wu, Shen, Wang, Dick, & Hengel, 2018) 73.0 55.0 35.7 40.0 28.0
(Pu, et al., 2016) 72.0 53.0 38.0 25.0 –
(Jia, Gavves, Fernando, & Tuytelaars, 2015) 64.6 46.6 30.5 20.6 17.9
(Wang, Yang, Bartz, & Meinel, 2016) 62.1 42.6 28.1 19.3 –
(Zhang, et al., 2020) 59.2 39.1 25.7 17.0 –
Flick30K (Xiao, Xue, Shen, & Gao, 2022) 68.3 49.9 35.5 25.3 20.8
(Chen L., et al., 2017) 66.2 46.8 32.5 22.3 19.5
(Karpathy & Fei-Fei, Deep Visual-Semantic Alignments for Generating Image 57.3 36.9 24.0 15.7 –
Descriptions, 2015) Multimodal Learning Based
(Chen & Zitnick, 2015) IC – – – 12.0 15.2
(Zhao, Chang, & Guo, 2019) 66.1 47.2 33.4 23.2 19.4
(Mao, et al., 2015) 60.0 41.0 28.0 19.0 –
Table 24
Performance of Change Image Captioning Task on Spot-the-diff Dataset.
Dataset Ref B-4 SPICE METEOR CIDEr
Spot-the-Diff (Park, Darrell, & Rohrbach, 2019) 40.3 16.1 27.1 56.7
(Kim, Kim, Lee, Park, & Kim, 2021) 44.5 17.1 29.2 70.0
(Shi, Yang, Gu, Joty, & Cai, 2020) 40.9 15.8 27.1 60.1
Table 25a
Performance of Stylized Image Captioning Task for Positive and Negative Styles.
Dataset Ref For Pos For Neg
Testing Training B-1 B-3 M C B-1 B-3 M C
SentiCap MSCOCO/SentiCap (Zhao, Wu, & Zhang, 2020) 50.8 17.1 16.6 54.4 48.7 19.6 15.8 60.6
SentiCap SentiCap (Mathews, Xie, & He, 2016) 49.1 17.5 16.8 54.4 50.0 20.3 16.8 61.8
SentiCap MSCOCO/SentiCap (Guo, Liu, Yao, Li, & Lu, 2019) 46.9 16.2 16.8 55.3 45.5 15.4 16.2 51.6
SentiCap MSCOCO/SentiCap (Chen T., et al., 2018) 50.5 19.1 16.6 60.0 50.3 20.1 16.2 59.7
SentiCap MSCOCO/SentiCap (Wu, Zhao, & Luo, 2022) 52.3 18.2 17.0 54.8 49.3 18.4 16.3 55.0
performance of (Zhao et al., 2020) (Gan et al., 2017) (Guo, Liu, Yao, Li, results on MSR-VTT are overall better than M− VAD and MPII-MD.
& Lu, 2019) (Chen et al., 2018d) evaluated for humor and romance Table 26 has reported mostly METEOR scores on three benchmark
styles collected from FlickrStyle10K dataset, are reported.Table 20a. datasets for different techniques. Performance-based analysis for
Table 20b.Table 22a.Table 22b.Table 25a.. different deep-learning techniques on MSR-VTT datasets for BLEU-4,
Performance analysis for various techniques for video captioning for METEOR, CIDEr, and ROUGE metrics is shown in Table 27. (Phan
different datasets using different evaluation metrics is reported in Ta et al., 2017) provide the best BLEU-4, METEOR, and ROUGE scores of
bles 26-31. From this, we conclude that most methods have used majorly 44.1, 29.1, and 62.4 MSR-VTT datasets respectively. Results of other
MSVD dataset followed by MSR-VTT, M− VAD, MPII-MD, ActivityNet, popular and recent datasets ActivityNet Captions and Charades are
and YouCookII for experiments. Though, M− VAD and MPII-MD are very presented in Tables 28 and 29. ActivityNet Captions dataset is mainly
challenging datasets but still show very low benchmark results while used for dense video captioning and is gaining popularity in recent
52
Table 25b
Performance of Stylized Image Captioning Task for romantic and humorous styles.
Dataset Ref For Romantic For Humorous
Testing Training B-1 B-3 M C B-1 B-3 M C
FlickrStyle10K MSCOCO/FlickrStyle10K (Zhao, Wu, & Zhang, 2020) 21.2 4.8 8.4 22.4 19.9 4.3 7.4 19.4
FlickrStyle10K FlickrStyle10K (Gan, Gan, He, & Gao, 2017) 46.1 15.2 15.4 31.0 48.7 14.6 15.2 27.0
FlickrStyle10K MSCOCO/FlickrStyle10K (Guo, Liu, Yao, Li, & Lu, 2019) 17.0 2.0 5.4 10.1 16.3 1.9 5.3 15.2
FlickrStyle10K MSCOCO/FlickrStyle10K (Chen T., et al., 2018) 27.8 8.2 11.2 37.5 27.4 8.5 11.0 39.5
FlickrStyle10K MSCOCO/FlickrStyle10K (Wu, Zhao, & Luo, 2022) 25.4 5.7 9.2 24.7 27.2 5.9 9.0 22.4
Table 26
Performance of deep-learning SSVC techniques on benchmark datasets.
Dataset Ref Captioning Method B-4 METEOR CIDEr ROUGE
MSVD (Venugopalan, Anne, Mooney, & Saenko, 2016) Seq-2-Seq With & Without 42.1 31,4 –
(Yang, et al., 2018) Attention 42.9 30.4 –
(Chen J., et al., Retrieval Augmented Convolutional Encoder-Decoder Networks for 53.5 34.6 82.4 –
Video Captioning, 2022)
(Pan, Yao, Li, & Mei, 2016) 52.8 33.5 74
(Bin, et al., 2018) – 29.8 – –
(Long, Gan, & Melo, 2016) 52.0 33.5 72.1 –
(Gao, Guo, Zhang, Xu, & Shen, 2017) 50.8 33.3 74.8 –
(Chen J., et al., Temporal Deformable Convolutional Encoder-Decoder Networks for 49.8 32.7 67.2 –
Video Captioning, 2019)
(Jin & Liang, 2016) – 26.17 – –
(Pan, Xu, Yang, Wu, & Zhuang, 2016) Hierarchical-Based VC 43.8 33.1 – –
(Liu A.-A., et al., 2017) 44.3 32.4 68.4 68.9
(Zhang & Peng, Video Captioning With Object-Aware Spatio-Temporal Correlation and 57.5 36.8 92.1 –
Aggregation, 2020)
(Wu & Han, 2020) 49.28 34.67 81.49 –
(Wei, Mi, Hu Zhen, & Chen, 2020) Deep-Reinforcement Based VC 46.8 34.4 85.7 –
(Chen, Wang, Zhang, & Huang, 2018) 46.1 33.1 69.1 69.2
(Pasunuru & Bansal, 2017) 54.4 34.9 88.6 72.2
(Gao, et al., 2022) Transformer Based VC 56.9 38.4 99.2 75.1
(Lin, et al., 2022) 58.2 41.3 120.6 77.5
M¡VAD (Pan, Mei, Yao, Li, & Rui, 2016) Seq-2-Seq With & Without 6.7 – –
(Venugopalan, et al., 2015) Attention – 6.7 – –
(Venugopalan, Anne, Mooney, & Saenko, 2016) – 6.7 – –
(Yang, et al., 2018) – 6.3 – –
(Pan, Yao, Li, & Mei, 2016) – 6.4 – –
MPII- (Pan, Mei, Yao, Li, & Rui, 2016) Seq-2-Seq With & Without – 7.3 – –
MD (Venugopalan, et al., 2015) Attention – 7.1 – –
(Venugopalan, Anne, Mooney, & Saenko, 2016) – 6.8 – –
(Yang, et al., 2018) – 7.2 – –
(Pan, Yao, Li, & Mei, 2016) – 7.4 – –
Table 27
Performance of some deep-learning SSVC techniques on commonly used MSR-VTT.
Dataset Ref Captioning Task B-4 METEOR CIDEr CIDEr- CIDEnT ROUGE
D
MSR- (Bin, et al., 2018) Seq-2-Seq With & Without – 26.1 – – – –

VTT (Long, Gan, & Melo, 2016) Attention 39.1 26.7 – – – –
(Gao, Guo, Zhang, Xu, & Shen, 2017) 38.8 26.1 43.2 – – –
(Chen J., et al., Temporal Deformable Convolutional Encoder- 39.5 27.5 42.8 – – –
Decoder Networks for Video Captioning, 2019)
(Chen J., et al., Retrieval Augmented Convolutional Encoder- 40.4 28.1 47.9 – – –
Decoder Networks for Video Captioning, 2022)
(Yang, et al., 2018) 36.0 26.1 – – – –
(Wu & Han, 2020) Hierarchical-Based VC 37.5 26.9 41.7 – – –
(Zhang & Peng, Video Captioning With Object-Aware Spatio- 41.9 28.6 48.2 – – –
Temporal Correlation and Aggregation, 2020)
(Jin, Huang, Chen, Li, & Zhang, 2020) Transformer Based VC 42.1 28.9 51.6 – – 61.5
(Gao, et al., 2022) 44.5 30.0 56.3 63.3
(Lin, et al., 2022) 41.9 29.9 53.8 62.1
(Chen, Wang, Zhang, & Huang, 2018) Deep-Reinforcement Based 38.9 27.2 42.1 – – 59.5
(Wei, Mi, Hu Zhen, & Chen, 2020) VC 38.5 26.9 – 43.7 – –
(Wang, Chen, Wu, Wang, & Wang, 2018) 41.3 28.7 – 48.0 – 61.7
(Pasunuru & Bansal, 2017) 40.5 28.4 – 51.7 44 61.4
(Phan, Henter, Miyao, & Satoh, 2017) 44.1 29.1 – 49.7 – 62.4
53
Table 28
Performance of Video captioning methods on ActivityNet Captions.
Dataset Ref Captioning B-1 B-2 B-3 B-4 METEOR CIDEr ROUGE
Method
ActivityNet (Zhou, Zhou, Corso, Socher, & Xiong, 2018) Dense VC – – – – 10.02 – –
(Xiong, Dai, & Lin, 2018) 39.11 22.26 13.52 8.45 14.75 14.15 25.88
(Zhang, Xu, Ouyang, & Tan, 2020) 22.76 10.12 4.26 1.64 10.71 31.41 22.85
(Suin & Rajagopalan, 2020) – – 2.87 1.35 6.21 13.82 –
(Iashin & Rahtu, Multi-modal Dense Video Captioning, 2020) – – 2.31 0.92 6.80 – –
(Yamazaki, et al., 2022) 13.38 17.48 30.29 35.99
(Iashin & Rahatu, A Better Use of Audio-Visual Cues: Dense Video – – 3.47 1.65 8.05 – –
Captioning with Bi-modal Transformer, 2020)
(Song, Chen, & Jin, 2021) – – – 12.20 16.10 27.36 –
(Sur, 2020) Transformer- – – 4.92 2.88 10.16 – –
(Sun, Myers, Vondrick, Murphy, & Schmid, 2019) Based VC – – 7.59 4.33 11.94 0.55 28.8
Table 29
Performance of Video captioning methods on Charades.
Dataset Ref Captioning Method B-1 B-2 B-3 B-4 METEOR CIDEr ROUGE
Charades (Zhao, Li, & Lu, 2018) Seq-2-Seq with Attention 50.7 31.3 19.7 13.3 19.0 18.0 –
(Wei, Mi, Hu Zhen, & Chen, 2020) Deep-Reinforcement Based VC – – – 12.7 17.2 21.6 –
(Wang, Chen, Wu, Wang, & Wang, 2018) 64.4 44.3 29.4 18.8 19.5 23.2 41.4
(Song, Chen, & Jin, 2021) Dense VC – – – 20.34 20.05 27.54 –
Table 30
Performance of Video captioning methods on YouCookII.
Dataset Ref Captioning Method B-3 B-4 METEOR CIDEr ROUGE
YouCook II (Luo, et al., 2020) Transformer-Based VC 23.87 17.35 22.35 1.81 46.52
(Sun, Baradel, Murphy, & Schmid, 2019) – 5.12 12.97 0.64 30.44
(Lin, et al., 2022) 13.8 9.0 15.6 109.0 37.3
(Zhu & Yang, 2020) 8.66 5.41 13.30 0.65 30.56
(Sur, 2020) – 4.20 6.95 – –
(Zhou, Zhou, Corso, Socher, & Xiong, 2018) Dense VC – 30.0 6.58 – –
(Yamazaki, et al., 2022) 9.56 17.95 49.41 35.17
Table 31
Performance of Video captioning methods on VATEX.
Dataset Ref Captioning Method B-4 METEOR CIDEr ROUGE
VATEX (Shi, et al., 2022) Transformer-Based VC 42.25 29.08 50.79 61.87

(Gao, et al., 2022) 44.5 30.0 56.3 63.3
(Lin, et al., 2022) 38.7 26.2 73.0 53.2
(Zhang, et al., 2020) 32.1 22.2 49.7 48.9
(Chen J. , et al., Retrieval Augmented Convolutional Encoder-Decoder Networks for Video Seq-2-Seq With 32.1 21.8 48.7 –
Captioning, 2022) Attention
years. (Xiong et al., 2018) provide the best BLEU scores while (Song performance of deep-learning-based image and video captioning tech
et al., 2021) provide the best METEOR score. While for Charades dataset niques has improved. The paper also touched upon Change Image
(Wang, Chen, Wu, Wang, & Wang, 2018) provides the best BLEU-1,2, Captioning (CIC) which is a very challenging task and a stepping stone in
and 3 scores, and (Song et al., 2021) provide the best BLEU-4 the field of visual captioning. CIC method utilizes time dimension along
METEOR and CIDEr scores. Also, Table 31 and Table 30shows results with the spatial dimension of an image which has huge potential ap
on the YouCookII and VATEX datasets. (Luo, et al., 2020) provided the plications in aerial imagery, analysis for disaster response systems and
best scores on BLEU-3, BLEU-4, METEOR, CIDEr, and ROUGE for You monitoring of land cover dynamics, street scene surveillance. We have
CookII whereas (Gao, et al., 2022) provided the best scores for BLEU-4, also reviewed various popular benchmark datasets, used commonly for
ROUGE, and METEOR while (Lin, et al., 2022) provided the best CIDE-r training and testing various visual captioning models and evaluation
score on VATEX dataset. metrics to detect the performance of the generated captions or de
scriptions. Although deep-learning-based techniques have achieved
5. Conclusion and Future Work remarkable progress in recent years, a robust and efficient captioning
model which can generate high-quality captions for images and videos is
In this survey, we have discussed the basic structure of visual yet to be achieved in terms of visual recognition and computational
captioning for the generation of natural language descriptions for a linguistics. Despite of recent success of deep-learning-based techniques,
given image/video and reviewed various methods for image and video there exist a few challenges for both image and video captioning
captioning techniques with more focus on recent deep-learning tech techniques:
niques which are still evolving in recent times. It proves that the
54
i) Deep-Learning based visual captioning lacks task-specific evalu results but still, there is always a scope for improvements in the
ation metrics i.e. for stylized captioning (Gan et al., 2017), existing state-of-the-art captioning for images and videos.
Change image captioning Qiu et al., (2020, August)., novel object iii) Real-time caption generation for images and videos remains the
(Agrawal, et al., 2019) and dense (Bao, Zheng, & Mu, 2021), most intimidating challenge to deal with. Therefore, unsuper
(Johnson et al., 2016) captioning. Most of the popularly known vised learning and reinforcement-based learning may prove to be
evaluation metrics, discussed in section 4, do not give preference a more realistic way of caption generation in real-time.
to word ordering and therefore, captions are generated with ab iv) Caption generation for 3-D images and videos is yet to be
normalities without considering negations and punctuations. In explored and is a promising research direction for the future.
this direction, (Li & Harrison, 2022) is a promising attempt to v) Furthermore, we have observed that all the metrics being used
incorporate styles for style-based image captioning evaluation. are either designed for image captioning or adopted from ma
Similar task-specific evaluation metrics can make an effective chine translation, and no metrics are designed dedicatedly for
evaluation by forcing the model to generate task-specific video captioning. This is one of the reasons for poor performance
captions. and efficiency reported for dense video captioning and story-
ii) The size of the dataset, i.e., FlickerStyle10K, SentiCap, MP-II, and telling tasks.
YouCook affects the training and hence the performance of the vi) Existing Visual Captioning techniques focus on the visual
computational model. Small-size datasets may lead to overfitting description problem. It would be more interesting to think one
resulting in variations between ground truth and generated cap step forward and develop a visual understanding system such as
tions. Further, small datasets affect the learning of syntactic Visual Question Answering (VQA) and Visual Reasoning. These
correctness resulting in missed information in generated have the potential to perform much better in the future.
captions.
iii) With the increase in complexity of visual tasks in increasingly
realistic conditions (Cao, et al., 2020) (Shao, Han, Marnerides, & Declaration of Competing Interest
Debattista, 2022), (Seo, Nagrani, Arnab, & Schmid, 2022), the
most intimidating challenge that the researchers face is how to The authors declare that they have no known competing financial
develop algorithms that can deal with the combinatorial explo interests or personal relationships that could have appeared to influence
sion. To overcome these, researchers will need to rethink how to the work reported in this paper.
train and evaluate vision algorithms.
iv) Current methods discussed in the survey require stronger lan Data availability
guage modeling structures for monitoring long dependencies
(Escorcia, Heilbron, Niebles, & Ghanem, 2016). This will help in No data was used for the research described in the article.
the generation of semantically and syntactically correct captions.
v) To increase the performance of captioning model speaking References
common sense knowledge can be incorporated. This helps the
model connect semantic concepts with visual cues, or human Aafaq, N., Mian, A., Liu, W., Gilani, S. Z., & Shah, M. (2019, October). Video Description:
gaze behaviors to imitate human visual focus. In this direction, A Survey of Methods, Datasets, and Evaluation Metrics. ACM Computing Surveys, 52
(6), 1-37.
(Fang, Gokhale, Banerjee, Baral, & Yang, 2020) proposed a cross- Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., . . . Anderson, P. (2019).
modal transformer model for the generation of common sense- nocaps: novel object captioning at scale. IEEE/CVF International Conference on
enriched descriptions of videos. Computer Vision (ICCV), (pp. 8948-8957). Seoul, Korea.
Alayrac, J., Bojanowski, P., Agrawal, N., Sivic, J., & Lacoste-Julien, S. (2016).
vi) Most of the methods discussed in this paper are sensitive to
Unsupervised learning from narrated instruction videos. In Computer Vision and
changes in visual data which causes the misidentification of ob Pattern Recognition (CVPR) (pp. 4575–4583). Caesars Palace: IEEE.
jects. Therefore, CIC-based models (Park et al., 2019) should be Alcantarilla, P. F., Stent, S., Ros, G., Arroyo, R., & Gherardi, R. (2018). Street-view
change detection with deconvolutional networks. Autonomous Robots, 42(7),
encouraged which are less sensitive to any change in the visual
1301–1322.
data and is capable to distinguish relevant scene changes from Amirian, S., Rsheed, K., Taha, T. R., & Arabnia, H. R. (2020). December). Automatic
illumination or viewpoint changes. Image and Video Caption Generation With Deep Learning: A Concise Review and
vii) Various deep-learning models only deal with the modeling of Algorithmic Overlap. IEEE Access, 8, 218386–218400.
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). SPICE: Semantic
salient objects with their trajectories (Zhang and Peng, 2020)(, Propositional Image Caption Evaluation. arXiv:1607.08822v1.
(Zhang and Peng, 2020) and ignore the understanding of inter Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018).
action relationships among objects, which is very challenging. To Bottom-Up and Top-Down Attention for Image Captioning and Visual Question
Answering. arXiv:1707.07998v3.
overcome this, more effective graphs should be constructed that Babru, A., Bridge, A., Burchill, Z., Coroian, D., Dickinson, S., Fidler, S., . . . Salvi, D.
model the relations among different object instances and explore (2012). Video in sentences out. arXiv:1204.2742.
their interactions between the backward temporal sequences in Bach, F. R., & Jordan, M. I. (2002). July). Kernel independent component analysis.
Journal of Machine Learning, 3, 1–48.
an end-to-end model. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly
learning to align and translate. arXiv:1409.0473.
Based on the challenges discussed above, various promising research Bai, S., & An, S. (2018). October). A survey on automatic image caption generation.
Neurocomputing, 311, 291–304.
directions are yet to be explored fully in the future such as:
Banerjee, S., & Lavie, A. (2005, June). METEOR: An Automatic Metric for MT Evaluation
with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop
i) The generation of captions with styles helps reflect various on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or
Summarization, (pp. 65-72). Ann Arbor, Michigan.
human emotions (romance, humor, etc.) in the captions. Stylized
Bao, P., Zheng, Q., & Mu, Y. (2021). Dense Events Grounding in Video. Virtual Mode:
captioning has not been much explored due to limited available Association for the Advancement of Artificial Intellegence.
data. Therefore, it is still an open issue. Bergman, L., & Hoshen, Y. (2020). Classification-based Anomaly detection for general
ii) Dense Visual Captioning is one such promising direction as it data. arXiv:2005.02359.
Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., … Plank, B.
generates more elaborated descriptions of a given image and (2016). April). Automatic Description Generation from Images: A Survey of Models,
video. The attention mechanism has shown a prominent impact Datasets, and Evaluation Measures. Journal of Artificial Intelligence Research, 55,
on the generation of the description of visual content. In the 409–442.
Bin, Y., Yang, Y., Shen, F., Xie, N., Shen, H. T., & Li, X. (2018, May). Describing Video
recent past, the models developed have shown encouraging With Attention-Based Bidirectional LSTM. IEEE Transactions on Cybernetics, 49(7),
2631-2641.
55
Brand, M. (1997). The” Inverse hollywood problem”: from video to scripts and Donahue, J., Hendricks, L., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K.,
storyboards via causal analysis. AAAI’97/IAAI’97: Proceedings of the fourteenth & Darrell, T. (2015). Long-Term Recurrent Convolutional Networks for Visual
national conference on artificial intelligence and ninth conference on Innovative Recognition and Description. IEEE Transactions on Pattern Analysis and Machine
applications of artificial intelligence, (pp. 132-137). Providence, Rhode Island. Intelligence, 677–691.
Bugliarello, E., & Elliott, D. (2021). The Role of Syntactic Planning in Compositional Dunning, T. (1993, March). Accurate methods for the statistics of surprise and
Image Captioning. arXiv:2101.11911v1. coincidence. Computational Linguistics, 19(1), 61-74.
Cao, T., Han, K., Wang, X., Ma, L., Fu, Y., Jiang, Y.-G., & Xue, X. (2020). Feature Escorcia, V., Heilbron, F. C., Niebles, J. C., & Ghanem, B. (2016). September) (pp.
Deformation Meta-Networks in Image Captioning of Novel Objects. The Thirty-Fourth 768–784). DAPs: Deep Action Proposals for Action Understanding. Lecture Notes in
AAAI Conference on Artificial Intelligence, (pp. 10494-10501). New York. Computer Science.
Chen, C., Mu, S., Xiao, W., Ye, Z., Wu, L., & Ju, Q. (2019). Improving Image Captioning Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollar, P., . . . Zweig, G.
with Conditional Generative Adversarial Nets. Proceedings of the Thirty-Third AAAI (2016). From Captions to Visual Concepts and Back. arXiv:1411.4952v3.
Conference on Artificial Intelligence, (pp. 8142–8150). Hawaii, USA. Fang, Z., Gokhale, T., Banerjee, P., Baral, C., & Yang, Y. (2020). In Video2Commonsense:
Chen, C.-K., Pan, Z. F., Sun, M., & Liu, M.-Y. (2018). Unsupervised Stylish Image Generating Commonsense Descriptions to Enrich Video Captioning (pp. 840–860).
Description Generation via Domain Layer Norm. arXiv:1809.06214v1. Online: Virtual.
Chen, D., & Dolan, W. (2011). Collecting highly parallel data for paraphrase evaluation. Farhadi, A., Hejrati, M., Sadeghi, M., Young, P., Rashtchian, C., Hockenmaier, J., &
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Forsyth, D. (2010). Every picture tells a story: Generating sentences from images.
Human Language Technologies, (pp. 190-200). Portland, Oregon, USA. Proceedings of the European Conference on Computer Vision, (pp. 15-29). Crete, Greece
Chen, H., Ding, G., Lin, Z., Zhao, S., & Han, J. (2018). Show, Observe and Tell: Attribute- .
driven Attention Model for Image Captioning. Proceedings of the Twenty-Seventh Feichtenhofer, C., Pinz, A., & Wildes, R. (2017). Spatiotemporal Multiplier Networks for
International Joint Conference on Artificial Intelligence, (pp. 606-612). Video Action Recognition. IEEE Conference on Computer Vision and Pattern Recognition
Chen, H., Ding, G., Zhao, S., & Han, J. (2018). Temporal-Difference Learning With (CVPR), (pp. 4768-4777). Honolulu, Hawaii.
Sampling Baseline for Image Captioning. Thirty-Second AAAI Conference on Artificial Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). September).
Intelligence, (pp. 6706-6713). Object detection with discriminatively trained part based models. IEEE Transactions
Chen, J., & Zhuge, H. (2020). A News Image Captioning Approach Based on Multi-Modal on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Pointer-Generator Network. Concurrency and Computation Practice and Experience, Feng, Q., Wu, Y., Fan, H., Yan, C., Xu, M., & Yang, Y. (2019). August). Cascaded Revision
1–25. Network for Novel Object Captioning. arXiv:1908.02726.
Chen, J., Pan, Y., Li, Y., Yao, T., Chao, H., & Mei, T. (2019). Temporal Deformable Feng, W., Tian, F.-P., Zhang, Q., Zhang, N., Wan, L., & Sun, J. (2015). Fine-grained
Convolutional Encoder-Decoder Networks for Video Captioning. Proceedings of the change detection of misaligned scenes with varied illuminations. International
Thirty-Third AAAI Conference on Artificial Intelligence, (pp. 8167-8174). Hawaii, USA. Conference on Computer Vision (ICCV), (pp. 1260-1268). Santiago, Chile.
Chen, J., Pan, Y., Li, Y., Yao, T., Chao, H., & Mei, T. (2022). Retrieval Augmented Gan, C., Gan, Z., He, X., & Gao, J. (2017). Stylenet: Generating attractive visual captions
Convolutional Encoder-Decoder Networks for Video Captioning (pp. 1–124). Computing, with styles. IEEE Conference on Computer Vision and Pattern Recognition, (pp. 3137-
Communication and Applications: ACM Transactions on Multimedia. 3146). Honolulu, Hawaii.
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., & Chua, T.-S. (2017). SCA-CNN: Gao, L., Guo, Z., Zhang, H., Xu, X., & Shen, H. T. (2017). July). Video Captioning with
Spatial and Channel-wise Attention in Convolutional Networks for Image Attention-based LSTM and Semantic Consistency. IEEE Transactions on Multimedia,
Captioning. arXiv:1611.05594v2. 19(9), 2045–2055.
Chen, T., Zhang, Z., You, Q., Fang, C., Wang, Z., Jin, H., & Luo, J. (2018). “Factual” or Gao, L., Wang, B., & Wang, W. (2018). Image Captioning with Scene-graph Based
“Emotional”: Stylized Image Captioning with Adaptive Learning and Attention. In Semantic Concepts. ICMLC 2018: Proceedings of the 2018 10th International
European Conference on Computer Vision (pp. 527–543). Munich, Germany: Springer. Conference on Machine Learning and Computing, (pp. 225-229). Macau, China.
Chen, X., & Zitnick, C. L. (2015). Mind’s eye: A recurrent visual representation for image Gao, Y., Hou, X., Suo, W., Sun, M., Ge, T., Jiang, Y., & Wang, P. (2022). Dual-Level
caption generation. IEEE conference on computer vision and pattern recognition, (pp. Decoupled Transformer for Video Captioning., arXiv:2205.03039v1, 1–10.
2422–2431). Boston, USA. Gella, S., Lewis, M., & Rohrbach, M. (2018). A Dataset for Telling the Stories of Social Media
Chen, Y., Wang, S., Zhang, W., & Huang, Q. (2018). Less Is More: Picking Informative Videos. Confrence on Empirical Methods in Natural Language Processing (pp. 968–974).
Frames for Video Captioning. arXiv:1803.01457. Brussels, Belgium: Association for Computational Linguistics.
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., . . . Liu, J. (2020). Uniter: Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R.,
Universal image-text representation learning. arXiv:1909.11740, (pp. 104-120). Darrell, T., & Saneko, K. (2013). YouTube2Text: Recognizing and Describing
Cheng, Y., Huang, F., Zhou, L., Jin, C., Zhang, Y., & Zhang, T. (2017). A Hierarchical Arbitrary Activities Using Semantic. IEEE International Conference on Computer Vision
Multimodal Attention-based Neural Network for Image Captioning. Proceedings of the (ICCV), (pp. 2712-2719). Sydney, Australia.
40th International ACM SIGIR Conference on Research and Development in Information, Gueguen, L., & Hamid, R. (2015). Large-scale damage detection using satellite imagery.
(pp. 889-892). Shinjuku, Tokyo, Japan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Cho, K., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase (CVPR), (pp. 1321-1328). Boston, USA.
Representations using RNN Encoder–Decoder for Statistical Machine Translation. Guo, D., Lu, R., Chen, B., & Zeng, Z. (2021). Matching Visual Features to Hierarchical
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing Semantic Topics for Image Paragraph Captioning., arXiv:2105.04143v1.
(EMNLP), (pp. 1724-1734). Doha Qatar. Guo, L., Liu, J., Yao, P., Li, J., & Lu, H. (2019). MSCap: Multi-Style Image Captioning with
Cho, K., Courville, A., & Bengio, Y. (2015, July). Describing multimedia content using Unpaired Stylized Text. Long Beach Convention & Entertainment Center.
attention-based encoder-decoder network. IEEE Transactions on Multimedia, 17(11), Gupta, A., Verma, Y., & Jawahar, C. V. (2012). Choosing linguistics over vision to
1875–1886. describe images. Proceedings of the Twenty-Sixth AAAI Conference on Artificial
Cho, K., Merriënboer, B. V., Bahdanau, D., & Bengio, Y. (2014). On the properties of Intelligence, (pp. 606-612). Toronto, Ontario, Canada.
neural machine translation: Encoder-decoder approaches. In Association for Hakeem, A., Sheikh, Y., & Shah, M. (2004). CASEˆE: a hierarchical event representation
Computational Linguistics, (pp. 103-111). Doha, Qatar. for the analysis of videos. American Association for Artificial Intelligence , (pp. 263-
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., … Schiele, B. 268). San Jose, California.
(2016). The cityscapes dataset for semantic urban scene understanding. In IEEE Hardoon, D. R., Szedmak, S. R., & Shawe-Taylor, J. R. (2004). December). Canonical
Conference on Computer Vision Pattern Recognition (CVPR) (pp. 3213–3223). Caesars correlation analysis: An overview with application to learning methods. Neural
Palace: IEEE. Computation, 16(12), 2639–2664.
Cornia, M., Stefanini, M., Baraldi, L., & Cucchiara, R. (2020). Meshed-Memory He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial Pyramid Pooling in Deep Convolutional
Transformer for Image Captioning. arXiv:1912.08226v2. Networks for Visual Recognition., arXiv:1406.4729, 1–13.
Dai, B., Fidler, S., Urtasun, R., & Lin, D. (2017). Towards Diverse and Natural Image He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
Descriptions via a Conditional GAN. arXiv:1703.06029v3. NV, USA: Las Vegas.
Das, P., Srihari, R. K., & Corso, J. J. (2013). Translating related words to videos and back Heidari, M., Ghatee, M., Nickabadi, A., & Nezhad, A. P. (2020). Diverse and styled image
through latent topics. Proceedings of the sixth ACM international conference on Web captioning using SVD based mixture of recurrent experts. arXiv:2007.03338v1.
search and data mining, (pp. 485–494). Texas, USA. Hendricks, L. A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., & Darrel, T.
Das, P., Xu, C., Doell, R. F., & Corso, J. J. (2013). A Thousand Frames in Just a Few (2016). In Deep compositional captioning: Describing novel object categories (pp. 1–10).
Words: Lingual Description of Videos through Latent Topics and Sparse Object Caesars Palace.
Stitching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. Herdade, S., Kappeler, A., Boakye, K., & Soares, J. (2020). Image Captioning:
2634–2641). Portland, OR, USA: IEEE. Transforming Objects into Words. arXiv:1906.05963v2.
Dave, J., & Padmavathi, S. (2022). Hierarchical Language Modeling for Dense Video Hochreiter, S., & Schmidhuber, J. (1997). December). Long short-term memory. Neural
Captioning. Inventive Computation and Information. Technologies. Computing, 9(8), 1735–1780.
Deng, J., Krause, J., A. C., & L. F.-F. (2012). Hedging your bets: Optimizing accuracy- Hodosh, M., Young, P., & Hockenmaier, J. (2013). August). Framing image description as
specificity trade-offs in large scale visual recognition. IEEE Conference on Computer a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence
Vision and Pattern Recognition, (pp. 3450-3457). Providence, RI. Research, 47, 853–899.
Deng, Z., Jiang, Z., Lan, R., Huang, W., & Luo, X. (2020, July). Image captioning using Hossain, M. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2018, October). A
DenseNet network and adaptive attention. Signal Processing: Image Communication, Comprehensive Survey of Deep Learning for Image Captioning. arXiv:1810.04020, 1-
85(12). 36.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep Hosseinzadeh, M., & Wang, Y. (2021). Image Change Captioning by Learning from an
bidirectional transformers for language understanding. arXiv:1810.04805. Auxiliary Task. IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ding, G., Chen, M., Zhao, S., Chen, H., Han, J., & Liu, Q. (2019). Neural Image Caption (CVPR), (pp. 2725-2734). Nashville, TN, USA.
Generation with Weighted Training. Cognitive Computation, 763–777.
56
Hu, X., Yin, X., Lin, K., Wang, L., Zhang, L., Gao, J., & Liu, Z. (2021). VIVO: Visual Kojima, A., Tamura, T., & Fukunag, K. (2002). Natural language description of human
Vocabulary Pre-Training for Novel Object Captioning. arXiv:2009.13682v2. AAAI. activities from video images based on concept hierarchy of actions. International
Huang, L., Wang, W., Chen, J., & Wei, X.-Y. (2019). In Attention on Attention for Image Journal of Computer Vision, 171–184.
Captioning (pp. 4634–4643). Coex: IEEE. Krause, J., Johnson, J., Krishna, R., & Li, F. (2016). A Hierarchical Approach for
Huang, R., Feng, W., Wang, Z., Fan, M., Wan, L., & Sun, J. (2017). Learning to detect Generating Descriptive Image Paragraphs. In IEEE Conference on Computer Vision and
fine-grained change under variant imaging conditions. International Conference on Pattern Recognition (CVPR) (pp. 3337–3345). Honolulu, HI, USA: IEEE.
Computer Vision Workshops (ICCV Workshops), (pp. 2916-2924). Venice, Italy. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., & Niebles, J. C. (2017). Dense-Captioning
Huang, X., Ma, H., & Zhang, H. (2009). In A new video text extraction approach (pp. Events in Videos. IEEE International Confrence on Computer Vision, (pp. 706-715).
650–653). New York NY USA: IEEE. Venice.
Hussain, Z., Zhang, M., Zhang, X., Ye, K., Thomas, C., Agha, Z., & K. O. (2017). Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., . . . Shamma, D. A.
Automatic understanding of image and video advertisements. In Proceedings of the (2017, May). Visual genome: Connecting language and vision using crowdsourced
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Hawaiʻi dense image annotations. International Journal of Computer Vision, 123(1), 32-73.
Convention Center: IEEE. Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Cehovin, L., Fernandez, G., . . .
Iashin, V., & Rahatu, E. (2020). A Better Use of Audio-Visual Cues. Dense Video Captioning Pflugfelder, R. (2015). The visual object tracking vot2015 challenge results.
with Bi-modal Transformer., arXiv:2005.08271v2, 1–22. International Conference on Computer Vision Workshops (ICCV Workshops). Santiago,
Iashin, V., & Rahtu, E. (2020). Multi-modal Dense Video Captioning., arXiv:2003.07758, Chile.
1–13. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017, June). ImageNet Classification with
Ilinykh, N., & Dobnik, S. (2020). In When an Image Tells a Story: The Role of Visual and Deep Convolutional Neural Netwroks. Communications of the ACM, 60(6), 84-90.
Semantic Information for Generating Paragraph Descriptions (pp. 338–348). Dublin City Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., . . . Berg, T. (2013, June).
University, DCU: Helix. Babytalk: Understanding and generating simple image descriptions. IEEE
Islam, S., Dash, A., Seum, A., Raj, A. H., Hossain, T., & Shah, F. M. (2021). February) (p. Transactions on Pattern Analysis and Machine Intellegence, 35(12), 2891-2903.
2). Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Kumar, A., & Goel, S. (2017, November). A survey of evolution of image captioning
Learning Methods. SN Computer Science. techniques. International Journal of Hybrid Intelligent Systems, 14(3), 123-139.
Jain, B., Thakur, S., & K, S. (2018). Visual assistance for blind using image processing. Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger, K. Q. (2015). From Word Embeddings
IEEE International Conference on Communication and Signal Processing (ICCSP), (pp. To Document Distances. Proceedings of the 32nd International Conference on Machine
499-503). Melmaruvathur, Tamilnadu, India . Learning, (PMLR), (pp. 957-966). Lille, France.
Jhamtani, H., & Kirkpatrick , T. B. (2018). Learning to Describe Differences Between Kuznetsova, P., Ordonez, V., Berg, T., & Choi, Y. (2014). Treetalk: Composition and
Pairs of Similar Images. arXiv:1808.10584. compression of trees for image descriptions. Transaction of Association for
Ji, W., Wang, R., Tian, Y., & Wang, X. (2022). An attention based dual learning approach Computational Linguistics, 10(2), 351–362.
for video captioning. Applied Soft Computing. Kyriazi, L. M., Han, G., & Rush, A. M. (2018). Training for Diversity in Image Paragraph
Jia, X., Gavves, E., Fernando, B., & Tuytelaars, T. (2015). Guiding the long-short term Captioning. Conference on Empirical Methods in Natural Language Processing, (pp.
memory model for image caption generation. IEEE International Conference on 757–761). Brussels, Belgium.
Computer Vision, (pp. 2407-2415). Santiago, Chile. Lee, H., Yoon, S., Dernoncourt, F., Bui, T., & Jung, K. (2021). UMIC: An Unreferenced
Jia, X., Wang, Y., Peng, Y., & Chen, S. (2022). Semantic association enhancement Metric for Image Captioning via Contrastive Learning. arXiv:2106.14019v1.
transformer with relative position for image captioning. Multimedia Tools and Lee, M. W., Hakeem, A., Haering, N., & Zhu, S.-C. (2008). Save: A framework for
Applications, 21349–21367. semantic annotation of visual events. In Computer Society Conference on Computer
Jin, J., Fu, K., Cui, R., Sha, F., & Zhang, C. (2015). Aligning where to see and what to tell: Vision and Pattern Recognition Workshops (pp. 1–8). Anchorage, AK: IEEE.
image caption with region-based attention and scene factorization., arXiv:1506.06272, Li, C., & Harrison, B. (2022). StyleM: Stylized Metrics for Image Captioning Built with
1–20. Contrastive N-grams. arXiv:2201.00975.
Jin, Q., & Liang, J. (2016). Video Description Generation using Audio and Visual Cues. Li, D., Zhang, Z., Yu, K., Huang, K., & Tan, T. (2019, June). Isee: An intelligent scene
Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, (pp. exploration and evaluation platform for large-scale visual surveillance. IEEE
239-242). New York, United States. Transactions on Parallel and Distributed Systems, 30(12), 2743-2758.
Jin, T., Huang, S., Chen, M., Li, Y., & Zhang, Z. (2020). SBAT: Video Captioning with Li, G., Zhai, Y., Lin, Z., & Zhang, Y. (2021). Similar Scenes arouse Similar Emotions:
Sparse Boundary-Aware Transformer. arXiv:2007.11888, (pp. 630-636). Parallel Data Augmentation for Stylized Image Captioning. MM ’21: Proceedings of
Johnson, J., Hariharan, B., Maaten, L., Fei-Fei, L., Zitnick, C. L., & Girshick, R. (2016). the 29th ACM International Conference on Multimedia, (pp. 5363-5372). Virtual Event,
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual China.
Reasoning. arXiv:1612.06890. Li, L., Chan, Y.-C., Cheng, Y., Gan, Z., Yu, L., & Liu, J. (2020). HERO: Hierarchical
Johnson, J., Karpathy, A., & Fei-Fei., L.. (2016). In Densecap: Fully convolutional Encoder for Video+Language Omni-representation Pre-training. arXiv:2005.00200.
localization networks for dense captioning (pp. 4565–4574). Caesars Palace: IEEE. Li, L., Gao, X., Deng, J., Tu, Y., Zha, Z.-J., & Huang, Q. (2022). Long Short-Term Relation
K, p., s, r., t, w., & w.j., z.. (2002). October). IBM Research Report Bleu: A method for Transformer With Global Gating for Video Captioning. IEEE TRANSACTIONS ON
automatic evaluation of machine translation. ACL Proceedings of Annual Meeting of IMAGE PROCESSING.
the Association for Computational Linguistics, 30, 311–318. Li, R., Liang, H., Shi, Y., Feng, F., & Wang, X. (2020). July). Dual-CNN: A Convolutional
Kalchbrenner, N., & Blunsom, P. (2013). In Recurrent Continuous Translation Models (pp. language decoder for paragraph image captioning. Neurocomputing, 396, 92–101.
1700–1709). Seattle, Washington, USA: Association for Computational Linguistics. Li, S., Kulkarni, G., Berg, T. L., Berg, A. C., & Cho, Y. (2011). Composing simple image
Karpathy, A., & Fei-Fei, L. (2015). Deep Visual-Semantic Alignments for Generating Image descriptions using web-scale n-gram. Fifteenth Conference on Computational Natural
Descriptions., arXiv:1412.2306, 3128–3137. Language Learning, (pp. 220-228). Portland, Oregon, USA.
Karpathy, A., Joulin, A., & Li, F.-F. (2014). Deep fragment embeddings for bidirectional Li, S., Tao, Z., Li, k., & Fu, Y. (2019, August). Visual to Text: Survey of Image and Video
image sentence mapping. Advances in neural information processing systems, (pp. Captioning. IEEE Transactions on Emerging Topics in Computational Intellegence, 3(4),
1889–1897). Montreal, Canada. 1-16.
Kazemzadeh, S., Ordonez, V., Matten, M., & Berg, T. L. (2014). ReferItGame: Referring to Li, Y., Yao, T., Pan, Y., Chao, H., & Mei, T. (2019). Pointing Novel Objects in Image
Objects in Photographs of Natural Scenes. Empirical Methods in Natural Language Captioning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Processing (EMNLP), (pp. 787-798). Doha, Qatar. (pp. 12497-12506). Long Beach, CA.
Khan, M. U., Zhang, L., & Gotoh, Y. (2011). Human focused video description. In IEEE Lie, J., Wang, L., Shen, Y., Yu, D., Berg, T. L., & Bansal, M. (2020). MART: Memory-
International Conference on Computer Vision Workshops (ICCV Workshops) (pp. Augmented Recurrent Transformer for cohorent Video Paragraph Captioning. arXiv:
1480–1487). Barcelona: IEEE. 2005.05402v1.
Khan, S. H., He, X., Porikli, F., & Bennamoun, M. (2017). June). Forest change detection Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. Association
in incomplete satellite images with deep neural networks. IEEE Transactions on for Computational Linguistics, (pp. 74-81). Barcelona, Spain.
Geoscience and Remote Sensing, 55, 5407–5423. Lin, K., Li, L., Ching, C., Ahmed, F., Gan, Z., Liu, Z., … Wang, L. (2022). SWINBERT: End-
Kim, D.-J., Choi, J., Oh, T.-H., & Kweon, I. S. (2019). Dense Relational Captioning: to-End Transformers with Sparse Attention for Video Captioning., arXiv:2111.13196,
Triple-Stream Networks for Relationship-Based Captioning. Conference on Computer 1–22.
Vision and Pattern Recognition (CVPR), (pp. 6271-6280). Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., . . . Zitnick, C. L.
Kim, D.-J., Oh, T.-H., Choi, J., & Kweon, I. S. (2020). Dense Relational Image Captioning (2014). Microsoft coco: Common objects in context. In European conference on
via Multi-task Triple-Stream Networks. arXiv:2010.03855v2. computer vision. European Conference on Computer Vision, (pp. 740-755).
Kim, H., Kim, J., Lee, H., Park, H., & Kim, G. (2021). Viewpoint-Agnostic Change Liu, A.-A., Xu, N., Wong, Y., Li, J., Su, Y.-T., & Kankanhalli, M. (2017). October).
Captioning with Cycle Consistency. Proceedings of the IEEE/CVF International Hierarchical & multimodal video captioning: Discovering and transferring
Conference on Computer Vision (ICCV), (pp. 2095-2104). Montreal. multimodal knowledge for vision to language. Computer Vision and Image
Kiros, R., Salakhutdinov, R., & Zemel, R. (2014). Multimodal Neural Language Models. Understanding, 163, 113–125.
Proceedings of the 31st International Conference on Machine Learning (PMLR) (pp. 595- Liu, B., Wang, D., Yang, X., Zhou, Y., Yao, R., Shao, Z., & Zhao, J. (2022). Show,
603). Bejing, China: PMLR. Deconfound and Tell: Image Captioning with Causal Inference. Proceedings of the
Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 18041-
with multimodal neural language models. arXiv:1411.2539v1, 1-13. 18050). IEEE Xplore.
Kojima, A., Izumi, M., Tamura, T., & Fukunaga, K. (2000). Generating natural language Liu, C., MAo, J., Sha, F., & Yuille, A. (2017). Attention Correctness in Neural Image
description of human behavior from video images. Proceedings 15th International Captioning. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence,
Conference on Pattern Recognition. Barcelona, Spain. (pp. 4176–4182). California, USA.
Liu, C., Sun, F., Wang, C., Wang, F., & Yuille, A. (2017). MAT: A Multimodal Attentive
Translator for Image Captioning. arXiv:1702.05658v3.
57
Liu, F., Peng, Y., & Rosen, M. (2019). An effective deep transfer learning and information Park, D. H., Darrell, T., & Rohrbach, A. (2019). Robust Change Captioning. arXiv:
fusion framework for medical visual question answering. In Cross-Language 1901.02527v2.
Evaluation Forum for European Languages (pp. 238–247). Lugano, Switzerland: Pasunuru , R., & Bansal, M. (2017). Reinforced Video Captioning with Entailment
Springer. Rewards. arXiv:1708.02300.
Liu, M., Hu, H., Li, L., Yu, Y., & Guan, W. (2020). Chinese Image Caption Generation via Patwari, N., & Naik, D. (2021). En-De-Cap: An Encoder Decoder model for Image
Visual Attention and Topic Modeling. IEEE Transactions on Cybernetics, 52(2), Captioning. In International Conference on Computing Methodologies and
1247–1257. Communication (ICCMC) (pp. 1192–1196). Tamil Naddu, India: IEEE.
Liu, M., Li, L., Hu, H., Guan, W., & Tian, J. (2020, March). Image caption generation with Pavlopoulos, J., Kougia, V., & Androutsopo, I. (2019). A Survey on Biomedical Image
dual attention mechanism. Image Processing and Management, 57(2), 102178. Captioning. Association for Computational Linguistics, (pp. 26-36). Minneapolis,
Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2018). Improved Image Minnesota.
Captioning via Policy Gradient optimization of SPIDEr. arXiv:1612.00370v4. Pedersoli, M., Lucas, T., Schmid, C., & Verbeek, J. (2017). Areas of Attention for Image
Liu, X., & Xu, Q. (2020, Dec). Adaptive Attention-based High-level Semantic Introduction Captioning. arXiv:1612.01033v2.
for Image Caption. ACM Transactions on Multimedia Computing, Communications, and Phan, S., Henter, G. E., Miyao, Y., & Satoh, S. (2017). Consensus-based Sequence
Applications, 16(4), 1-22. Training for Video Captioning. arXiv:1712.09532.
Liu, X., Xu, Q., & Wang, N. (2018). June). A survey on deep neural network-based image Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., & Carin, L. (2016). Variational
captioning. Springer Nature, The Visual Computer, 35, 445–470. Autoencoder for Deep Learning of Images, Labels and Captions. NIPS’16: Proceedings
Liu, Z., Li, G., Mercier, G., He, Y., & Pan, Q. (2018, December). Change detection in of the 30th International Conference on Neural Information Processing Systems, (pp.
heterogenous remote sensing images via homogeneous pixel transformation. IEEE 2360-2368). Barcelona, Spain.
Transactions on Image Processing, 27(4), 1822–1834. Pu, Y., Yuan, X., Stevens, A., Li, C., & Carin, L. (2016). A deep generative
Long, J., Shelhamer, E., & Darrell., T. (2015). Fully convolutional networks for semantic deconvolutional image model. 19th International Conference on Artificial Intelligence
segmentation. arXiv:1411.4038. and Statistics (AISTATS) (pp. 741-750). Cadiz, Spain: Proceedings of Machine
Long, X., Gan, C., & Melo, G. d. (2016, December). Video Captioning with Multi-Faceted Learning Research.
Attention. Transactions of the Association for Computational Linguistics, 6(1), 173–184. Qi, J. (2018). Study of Video Captioning Problem.
Lopez, A. (2008). Statistical Machine Translation. ACM Computing Surveys. Qiu, J., Lo, F. P.-W., Gu, X., Jobarteh, M. L., Jia, W., & Baranowski, T. (2021). Egocentric
Lu, H., Yang, R., Deng, Z., Zhang, Y., Gao, G., & Lan, R. (2021). Chinese Image Image Captioning for Privacy-Preserved Passive Dietary Intake Monitoring. arXiv:
Captioning via Fuzzy Attention-based DenseNet-BiLSTM. ACM Transactions on 2107.00372v1.
Multimedia Computing, Communications, and Applications, 17(1s), 1–18. Qiu, Y., Satoh, Y., Suzuki, R., Iwata, K., & Kataoka, H. (2020). 3D-Aware Scene Change
Lu, J., Xiong, C., Parikh, D., & Socher, R. (2017). Knowing When to Look: Adaptive Captioning From Multiview Images. IEEE Robotics and Automation Letters,
Attention via A Visual Sentinel for Image Captioning. arXiv:1612.01887v2. 2377–3766.
Luo, H., Ji, L., Shi, B., Huang, H., Duan, N., Li, T., . . . Zhou, M. (2020). UniVL: A unified Qiu, Y., Satoh, Y., Suzuki, R., Iwata, K., & Kataoka, H. (2020, August). Indoor Scene
video and language pre-training model for multimodal understanding and Change Captioning Based on Multimodality Data. Sensor Signal and Information
generation. arXiv:2002.06353. Processing III, 20(17), 1-18.
Ma, S., & Han, Y. (2016). Describing images by feeding LSTM with structural words. In Redmon, J., & Farahadi, A. (2018). YOLOv3: An incremental improvement. (arXiv:
International Conference on Multimedia and Expo (ICME) (pp. 1–6). Hamburg: IEEE. 1804.02767, Ed.) arXiv:1804.02767.
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., & Murphy, K. (2016). Ren, S., He, K., Girshick, R., & Sun, J. (2015). December). Faster R-CNN: Towards real-
Generation and comprehension of unambiguous object descriptions. In IEEE time object detection with region proposal networks. Advances in Neural Information
conference on computer vision and pattern recognition (pp. 11–20). Caesars Palace. Processing Systems, 28, 91–99.
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2015). Deep captioning with Ren, Z., Wang, X., Zhang, N., Lv, X., & Li, L.-J. (2017). Deep Reinforcement Learning-
multimodal recurrent neural networks (m-rnn). arXiv:1412.6632. based Image Captioning with Embedding Reward. arXiv:1704.03899v1.
Martin, J. P., Bustos, B., Jamil F, S., Sipiran, I., Perez, J., & Said, G. C. (2021). Bridging Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., & Goel, V. (2017). Self-critical
Vision and Language from the Video-to-Text Perspective: A Comprehensive Review. Sequence Training for Image Captioning. arXiv:1612.00563v2.
arXiv:2103.14785v1. Rohrbach, A., Rohrbach, M., Tandon, N., & Schiel, B. (2015). A dataset for movie
Mathews, A. P., Xie, L., & He, X. (2016). SentiCap: Generating Image Descriptions with description. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Sentiments. AAAI’16: Proceedings of the Thirtieth AAAI Conference on Artificial (pp. 3202–3212). Boston, MA, USA: IEEE.
Intelligence, (pp. 3574–3580). Phoenix, Arizona. Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained
Mehri, S., & Sigal, L. (2018). Middle-Out Decoding. arXiv:1810.11735, (pp. 5523–5534). activity detection of cooking activities. In IEEE Conference on Computer Vision and
Miech, A., Zhukov, D., Alayrac, J.-B., Tapaswi, M., Laptev, I., & Sivic, J. (2019). Pattern Recognition (CVPR) (pp. 1194–1201). Providence, RI, USA: IEEE.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinka, M., & Schiele, B. (2013). Translating
Narrated Video Clips. arXiv:1906.03327v2. video content to natural language descriptions. IEEE International Conference on
Mishra, S. K., Dhir, R., Saha, S., Bhattacharyya, P., & Singh, A. K. (2021, June). Image Computer Vision (ICCV). Sydney, NSW, Australia: IEEE.
captioning in Hindi language using transformer networks. Computers & Electrical Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., & Schiele, B. (2012).
Engineering, 92. Script data for attribute-based recognition of composite activities. Proceedings of the
Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., . . . Daume, H. 12th European conference on Computer Vision , (pp. 144-157). Florence, Italy.
(2012). Generating image descriptions from computer vision detections. Proceedings Sakurada, K., Wang, W., Kawaguchi, N., & Nakamur, R. (2017). Dense optical flow based
of the 13th Conference of the European Chapter of the Association for Computational change detection network robust to difference of camera viewpoints. arXiv:
Linguistics, (pp. 747-756). Avignon . 1712.02941.
Mun, J., Yang, L., Ren, Z., Xu, N., & Han, B. (2019). Streamlined Dense Video Captioning. Senina, A., Rohrbach, M., Qiu, W., Friedrich, A., Amin, S., Andriluka, M., . . . Schiele, B.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2014). Coherent Multi-Sentence Video Description with Variable Level of Detail.
(CVPR), (pp. 6588-6597). Long Beach, CA. arXiv:1403.6173.
Nian, F., Li, T., Wang, Y., Wu, X., Ni, B., & Xu, C. (2017). October). Learning explicit Seo, P. H., Nagrani, A., Arnab, A., & Schmid, C. (2022). End-to-end Generative
video attributes from mid-level representation for video captioning. Computer Vision Pretraining for Multimodal Video Captioning. In IEEE/CVF Conference on Computer
and Image Understanding, 163, 126–138. Vision and Pattern Recognition (CVPR) (pp. 17959–17968). New Orleans: IEEE.
Nikolaus, M., Abdou, M., Lamm, M., Aralikatte, R., & Elliott, D. (2019). Compositional Shao, Z., Han, J., Marnerides, D., & Debattista, K. (2022). Region-Object Relation-Aware
Generalization in Image Captioning. arXiv:1909.04402v2. Dense Captioning via Transformer. IEEE Transactions on Neural Networks and Learning
Nivedita, M., Chandrashekar, P., Mahapatra, S., & Phamila, A. (2021, March). Image Systems, 1–12.
Captioning for Video Surveillance System using Neural Networks. International Sharma, D., Dhiman, C., & Kumar, D. (2022). Automated Image Caption Generation
Journal of Image and Graphics, 21(4). Framework using Adaptive Attention and Bi-LSTM. IEEE Delhi Section Conference
Oliva, A., & Torralba, A. (2001, May). Modeling the shape of the scene: a holistic (DELCON). Delhi.
representation of the spatial envelope. International Journal of Computer Vision, 42(3), Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018). Conceptual Captions: A
145-175. Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning.
Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2Text: describing images using 1 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics,
million. Proceedings of the Advances in Neural Information Processing Systems, (pp. (pp. 2556-2565). Melbourne, Australia.
1143-1151). Shi, X., Yang, X., Gu, J., Joty, S., & Cai, J. (2020). Finding It at Another Side: A
P, k.. (2009). Statistical machine translation. Cambridge University Press. Viewpoint-Adapted Matching Encoder for Change Captioning. arXiv:2009.14352v1.
Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016). Hierarchical Recurrent Neural Shi, Y., Liu, Y., Feng, F., Li, R., Ma, Z., & Wang, X. (2021). S2TD: A Tree-Structured
Encoder for Video Representation with Application to Captioning. IEEE. arXiv: Decoder for Image Paragraph Captioning. In MMAsia ’21: ACM Multimedia Asia (pp.
1511.03476. 1–7). Gold Coast, Australia: ACM.
Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016). Jointly modeling embedding and Shi, Y., Xu, H., Yuan, C., Li, B., Hu, W., & Zha, Z. J. (2022). Learning Video-Text Aligned
translation to bridge video and language. In IEEE Conference on Computer Vision and Representations for Video Captioning. ACM Trans. Multimedia Comput. Commun.
Pattern Recognition (CVPR) (pp. 4594–4602). Las Vegas, NV, USA: IEEE Xplore. Appl, 1–21.
Pan, Y., Yao, T., Li, H., & Mei, T. (2016). Video Captioning with Transferred Semantic Shi, Z., Zhou, X., Qiu, X., & Zhu, X. (2020). Improving Image Captioning with Better Use
Attributes. arXiv:1611.07675v1. of Captions. arXiv:2006.11807v1.
Park, C. C., Kim, B., & Kim, G. (2017). Attend to You: Personalized Image Captioning Shin, A., Ohnishi, K., & Harada, T. (2016). Beyond Caption to Narrative: Video
with Context Sequence Memory Networks. In IEEE Conference on Computer Vision and Captioning with Multiple Sentences . IEEE International Conference on Image
Pattern Recognition (CVPR) (pp. 6432–6440). Honolulu, HI, USA: IEEE. Processing (ICIP), (pp. 3364–3368). Phoenix, Arizona.
58
Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., & Gupta, A. (2016). Wang, C., Yang, H., Bartz, C., & Meinel, C. (2016). Image captioning with deep bidirectional
Hollywood in homes: Crowdsourcing data collection for activity understanding. IEEE LSTMs., arXiv:1604.00790, 988–997.
European Conference on Computer Vision. Amsterdam, The Netherlands. Wang, L., Qiao, Y., & Tang, X. (2013). Mining Motion Atoms and Phrases for Complex
Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large- Action Recognition. IEEE International Conference on Computer Vision, (pp. 2680-
Scale Image Recognition. arXiv:1409.1556. 2687). Sydeny, Australia.
Singh, A., Doren, T., & Bandyo, S. (2020). November). A Comprehensive Review on Recent Wang, M., Song, L., Yang, X., & Luo, C. (2016). In A parallel-fusion RNN-LSTM architecture
Methods and Challenges of Video Description., arXiv:2011.14752v1, 1–35. for image caption generation (pp. 4448–4452). Phoenix, Arizona: IEEE.
Song, Y., Chen, S., & Jin, Q. (2021). Towards Diverse Paragraph Captioning for Wang, T., Zhang, R., Lu, Z., Zheng, F., Cheng, R., & Luo, P. (2021). In End-to-End Dense
Untrimmed Videos. Conference on Computer Vision and Pattern Recognition (CVPR), Video Captioning with Parallel Decoding (pp. 6487–6857). Monteral.
(pp. 11240-11249). Nashville. Wang, X., Chen, W., Wu, J., Wang, Y.-F., & Wang, W. Y. (2018). In Video Captioning via
Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015). Unsupervised learning of video Hierarchical Reinforcement Learning (pp. 4213–4222). USA: Salt Lake City.
representations using lstms. ICML’15: Proceedings of the 32nd International Conference Wang, X., Wu, J., Chen, J., Fi, L., Wang, Y.-F., & Wang, W. Y. (2020). VATEX: A Large-
on International Conference on Machine Learning, (pp. 843-852). Lille, France. Scale, High-Quality Multilingual Dataset for Video-and-Language. Research. arXiv:
Staniute, R., & Šešok, D. (2019). May). A Systematic Literature Review on Image 1904.03493v3.
Captioning. Applied Sciences, 9, 1–20. Wang, Z., Luo, Y., Li, Y., Huang, Z., & Yin, H. (2018). In Look Deeper See Richer:Depth-
Stent, S., Gherardi, R., Stenger, B., & Cipolla, R. (2016). In Precise deterministic change aware Image Paragraph Captioning (pp. 672–680). Korea: Seoul.
detection for smooth surfaces (pp. 1–9). USA: Lake Placid. Wei, R., Mi, L., Hu Zhen, Y., & Chen, Z. (2020). Feb). Exploiting the local temporal
Suin, M., & Rajagopalan, A. N. (2020). An Efficient Framework for Dense Video information for video captioning. Journal of Visual Communication and Image
Captioning. Proceedings of the AAAI Conference on Artificial Intelligence. New York. Representation, 67(C).
Sun, C., & Nevatia, R. (2014). In Semantic Aware Video Transcription Using Random Forest Weiss, M., Chamorro, S., Girgis, R., Luck, M., Kahou, S., Cohen, J., . . . Pal, C. (2019).
Classifiers (pp. 772–786). Zurich: Springer. Navigation agents for the visually impaired: A sidewalk simulator and experiments.
Sun, C., Baradel, F., Murphy, K., & Schmid, C. (2019). Contrastive bidirectional arXiv:1910.13249.
transformer for temporal representation learning. arXiv preprint arXiv:1906.05743. Williams, R. J. (1992). May). Simple statistical gradient-following algorithms for
Sun, C., Myers, A., Vondrick, C., Murphy, K., & Schmid, C. (2019). In Videobert: A joint connectionist reinforcement learning. Machine Learning, 8, 229–256.
model for video and language representation learning (pp. 7463–7472). Korea: Seoul. Wu, A., & Han, Y. (2020). Hierarchical Memory Decoding for Video Captioning. arXiv:
Sur, C. (2020). SACT: Self-Aware Multi-Space Feature Composition Transformer for 2002.11886.
Multinomial Attention for Video Captioning. arXiv:2006.14262. Wu, Q., Shen, C., Liu, L., Dick, A., & Hengel, A. v. (2016). What Value Do Explicit High
Sutskever, I., Vinyals, O., & Quoc V. Le. (2014). Sequence to Sequence Learning with Level Concepts Have in Vision to Language Problems? arXiv:1506.01144v6.
neural networks. arXiv:1409.3215. Wu, Q., Shen, C., Wang, P., Dick, A., & Hengel, A. v. (2018, March). Image captioning
Szegedy, C., & Liu, W. (2014). Going deeper with convolutions. arXiv:1409.4842. and visual question answering based on attributes and external knowledge. IEEE
Tan, C. C., Jiang, Y.-G., & Ngo, C.-W. (2011). Towards textually describing complex transactions on pattern analysis and machine intelligence, 40(6), 1367-1381.
video contents with audio-visual concept classifiers. MM ’11: Proceedings of the 19th Wu, X., Zhao, W., & Luo, J. (2022). Learning Cooperative Neural Modules for Stylized
ACM international conference on Multimedia, (pp. 655-658). Arizona, USA. Image Captioning. International Journal of Computer Vision, 2305–2320.
Tan, F., Feng, S., & Ordonez, V. (2019). Text2Scene: Generating Compositional Scenes Wu, Y., Jiang, L., & Yang, Y. (2023). Switchable Novel Object Captioner. IEEE
from Textual Descriptions. IEEE/CVF Conference on Computer Vision and Pattern Transactions on Pattern Analysis and Machine Intelligence, 45(1), 1162–1173.
Recognition (CVPR), (pp. 6703-6712). Long Beach, CA. Wu, Y., Zhu, L., Jiang, L., & Yang, Y. (2018). Decoupled Novel Object Captioner. MM ’18:
Tan, Y., Lin, Z., Fu, P., Zheng, M., Wang, L., Cao, Y., & Wang, W. (2022). In Detach and Proceedings of the 26th ACM international conference on Multimedia, (pp. 1029-1037).
Attach: Stylized Image Captioning without Paired Stylized Dataset (pp. 4733–4741). Seoul, Korea.
ACM-DL. Xian, X., & Tian, Y. (2019, May). Self-Guiding Multimodal LSTM—When We Do Not
Tavakoli, H. R., Shetty, R., Borji, A., & Laaksonen, J. (2017). Paying Attention to Have a Perfect Training Dataset for Image Captioning. IEEE Transactions on Image
Descriptions Generated by Image Captioning Models. arXiv:1704.07434v3. Processing, 28(11), 5241 - 5252.
Tena, C. F., Baiget, P., Roca, X., & Gonzàlez, J. (2007). In Natural language descriptions of Xiao, F., Xue, W., Shen, Y., & Gao, X. (2022). Feburary). A New Attention-Based LSTM for
human behavior from video sequences (pp. 279–292). Osnabrück, Germany: Springer. Image Captioning. Neural Process Letters, 54, 3157–3171.
Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., & Mooney, R. (2014). Xiao, X., Wang, L., Ding, K., Xiang, S., & Pan, C. (2019). June). Dense semantic
Integrating Language and Vision to Generate Natural Language Descriptions of embedding network for image captioning. Pattern Recognition, 90, 285–296.
Videos in the Wild. 25th International Conference on Computational Linguistics, (pp. Xiong, Y., Dai, B., & Lin, D. (2018). Move Forward and Tell: A Progressive Generator of
1218-1227). Dublin, Ireland. Video Descriptions. arXiv:1807.10018v1.
Tian, J., & Oh, J. (2020). Image Captioning with Compositional Neural Module Xu, H., Venugopalan, S., Ramanis, V., Rohrbach, M., & Saenko, K. (2015). A multi-scale
Networks. arXiv:2007.05608v1. multiple instance video description network. arXiv:1505.05914.
Tian, J., Cui, S., & Reinartz, P. (2014). January). Building change detection based on Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). MSR-VTT: A large video description dataset for
satellite stereo imagery and digital surface models. IEEE Transactions on Geoscience bridging video and language. Conference on Computer Vision and Pattern Recognition
and Remote Sensing, 52, 406–417. (CVPR). Caesars Palace.
Torabi, A., Pal, C., Larochelle, H., & Courville, A. (2015). Using descriptive video services Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudino, R., . . . Bengio, Y. (2015).
to create a large data source for video annotation research. arXiv:1503.01070. Show, attend and tell: Neural image caption generation with visual attention.
Tran, K., He, X., Zhang, L., Sun, J., Carapcea, C., Thrasher, C., . . . Sienkiewicz, C. (2016). Proceedings of the 32nd International Conference on Machine Learning (PMLR), (pp.
Rich Image Captioning in the Wild. arXiv:1603.09016v2. 2048–2057). Lille, France.
Tripathi, S., Nguyen, K., Guha, T., Du, B., & Nguyen, T. Q. (2021). SG2Caps. Revisiting Xu, R., Xiong, C., Chen, W., & Corso, J. J. (2015). Jointly Modeling Deep Video and
Scene Graphs for Image Captioning., arXiv:2102.04990v1. Compositional Text to Bridge Vision and Language in a Unified Framework.
Tu, Y., Li, L., Yan, C., Gao, S., & Yu, Z. (2021). R3Net:Relation-embedded Representation AAAI’15: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence,
Reconstruction Network for Change Captioning., arXiv:2110.10328v1. (pp. 2346-2352). Texas, USA.
Tu, Y., Yao, T., Li, L., Lou, J., Gao, S., Yu, Z., & Yan, C. (2021). Semantic Relation-aware Xu, Z., Mei, L., Lv, Z., Hu, C., Luo, X., Zhang, H., & Liu, Y. (2017, January). Multi-Modal
Difference Representation Learning for Change Captioning. Findings of the Association Description of Public Safety Events Using Surveillance and Social Media. IEEE
for Computational Linguistics: ACL-IJCNLP, (pp. 63-73). Online. Transaction on Big Data, 5(4), 529-539.
Tu, Y., Zhang, X., Liu, B., & Yan, C. (2017). Video Description with Spatial-Temporal Yamazaki, K., Truong, S., Vo, K., Kidd, M., Rainwater, C., Luu, K., & Le, N. (2022).
Attention. MM ’17: Proceedings of the 25th ACM international conference on VLCAP: Vision Language with contrastive learning for coherent video paraghraph
Multimedia, (pp. 1014-1022). New York, United States. captioning. arXiv:2206.12972v2.
Ushiku, Y., Harada, T., & Kuniyoshi, Y. (2012). Efficient Image Annotation for Automatic Yan, C., Hao, Y., Li, L., Yin, J., Liu, A., Mao, Z., … Gao, X. (2021). March). IEEE
Sentence Generation. Proceedings of the 20th ACM International Conference on Transactions on Circuits and Systems for Video Technology: Task-Adaptive Attention
Multimedia, (pp. 549–558). Nara, Japan. for Image Captioning.
Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). CIDEr: Consensus-based Image Yang, L., Tang, K., Yang, J., & Li, L.-J. (2016). Dense Captioning with Joint Inference and
Description Evaluation. arXiv:1411.5726v2. Visual Context. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Venugopalan, S., Anne, L. H., Mooney, R., & Saenko, K. (2016). Improving LSTM-based (pp. 1978–1987). Caesars Palace.
Video Description with Linguistic Knowledge Mined from Text. arXiv:1604.01729v2. Yang, L.-C., Yang, C.-Y., & Hsu, J. Y.-j. (2021). Object Relation Attention for Image
Venugopalan, S., Hendricks, L., Rohrbach, M., Mooney, R., Darrell, T., & Saenko, K. Paragraph Captioning. Proceedings of the AAAI Conference on Artificial Intelligence,
(2016). Captioning images with diverse objects. arXiv preprint arXiv:1606.07770. (pp. 3136-3144). Virtual Confrence.
Venugopalan, S., Rohrbach, M., Donahue, J., Moone, R., Darrell, T., & Saneko, K. (2015). Yang, W. (2019). January) (p. 17). EURASIP Journal on Image and Video Processing:
Sequence to sequence—Video to text., arXiv:1505.00487, 4534–4542. Analysis of sports image detection technology based on machine learning.
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2014). Yang, X., Gao, C., Zhang, H., & Cai, J. (2020). Hierarchical Scene Graph Encoder-
Translating videos to natural language using deep recurrent neural networks. arXiv: Decoder for Image Paragraph Captioning. MM ’20: Proceedings of the 28th ACM
1412.4729. International Conference on Multimedia, (pp. 4181-4189). Seattle WA USA.
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and Tell: A Neural Image Yang, Y., Teo, C. L., Daume, H., & Aloimono, Y. (2011). Corpus-guided sentence
Caption Generator. IEEE Conference on Computer Vision and Pattern Recognition generation of natural images. In Empirical Methods in Natural Language Processing (pp.
(CVPR) , (pp. 3156–3164). Boston, USA. 444–454). Edinburgh United Kingdom.
Vries, H. d., Shuster, K. S., Batra, D., Weston, J., & Kiela, D. (2018, July). Talk the Walk: Yang, Y., Zhou, J., Ai, J., Bin, Y., Hanjalic, A., Shen, H. T., & Li, Y. (2018, November).
Navigating New York City through Grounded Dialogue. arXiv:1807.03367. doi: Video captioning by adversarial LSTM. IEEE Transactions on Image Processing, 27(11),
CoRRabs/1807.03367. 5600-5611.
59
Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. (2016). Encode, Review, Zhang, J., & Peng, Y. (2020). Video Captioning With Object-Aware Spatio-Temporal
and Decode. Reviewer Module for Caption Generation. arXiv:1605.07912. Correlation and Aggregation. IEEE Transactions on Image Processing, 6209–6222.
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Zhang, X., He, S., Song, X., Lau, R. W., Jiao, J., & Ye, Q. (2020). June). Image captioning
Describing Videos by Exploiting Temporal Structure. IEEE International Confrence on via semantic element embedding. Neurocomputing, 395, 212–221.
Computer Vision, (pp. 4507–4515). Santiago, Chile. Zhang, Z., Shi, Y., Yuan, C., Li, B., Wang, P., Hu, W., & Zha, Z. (2020). Object Relational
Yao, T., Pan, Y., Li, Y., & Mei, T. (2017). Incorporating copying mechanism in image Graph with Teacher-Recommended Learning for Video Captioning. arXiv:
captioning for learning novel objects. In IEEE Conference on Computer Vision and 2002.11566, 1-10.
Pattern Recognition (CVPR) (pp. 5263–5271). Honolulu, Hawaii: IEEE. Zhang, Z., Wu, Q., Wang, Y., & Chen, F. (2021). May) (p. 109). Exploring region
Yao, T., Pan, Y., Li, Y., & Mei, T. (2018). Exploring Visual Relationship for image relationships implicitly: Image captioning with visual relationship attention. Image
captioning. In Lecture Notes in Computer Science, ECCV (pp. 1–16). Springer. and Vision Computing.
Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2017). Boosting image captioning with Zhang, Z., Xu, D., Ouyang, W., & Tan, C. (2020, September). Show, Tell and Summarize:
attributes. In International Conference on Computer Vision (ICCV) (pp. 4904–4912). Dense Video Captioning Using Visual Cue Aided Sentence Summarization. IEEE
Venice: IEEE. Transactions on Circuits and Systems for Video Technology, 30(9), 3130-3139.
You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic Zhang, Z., Zhang, Y., Shi, Y., Yu, W., Nie, L., He, G., . . . Yang, Z. (2019). Dense Image
attention. In IEEE Conference on Computer Vision and Pattern Recognition (pp. Captioning Based on Precise Feature Extraction. International Conference on Neural
4651–4659). Caesars Palace: IEEE. Information Processing, (pp. 83-90). Sydney, Australia.
Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to Zhao, B., Li, X., & Lu, X. (2018). Video Captioning with Tube Features. Proceedings of the
visual denotations: New similarity metrics for semantic inference over event Twenty-Seventh International Joint Conference on Artificial Intelligence. Stockholm,
descriptions. Transactions of the Association for Computational Linguistics, 2, 67–78. Sweden.
Yu, H., & Siskind, J. M. (2015). Learning to describe video with weak supervision by Zhao, D., Chang, Z., & Guo, S. (2019). Feb). A multimodal fusion approach for image
exploiting negative sentential information. Proceedings of the Twenty-Ninth AAAI captioning. Neurocomputing, 329, 476–485.
Conference on Artificial Intelligence, (pp. 3855-3863). Texas, USA. Zhao, W., Wu, X., & Zhang, X. (2020). MemCap: Memorizing Style Knowledge for Image
Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using Captioning. Proceedings of the AAAI Conference on Artificial Intelligence, (pp. 12984-
hierarchical recurrent neural networks. In IEEE Conference on Computer Vision and 12992). California USA.
Pattern Recognition (CVPR) (pp. 4584–4593). Caesars Palace. Zhong, M., Zhang, H., Wang, Y., & Xiong, H. (2022). BiTransformer: Augmenting
Z, W., T, Y., Y, F., & G, J. Y. (2016). Deep Learning for Video Classification and semantic context in video captioning via bidirectional decoder. Machine Vision and.
Captioning. arXiv:1609.06782. Applications.
Zeng, K., Chen, T., Niebles, J. C., & Sun, M. (2016). Title Generation for User Generated Zhou, L., Zhou, Y., Corso, J. J., Socher, R., & Xiong, C. (2018). End-to-End Dense Video
Videos. IEEE European Conference on Computer Vision. Amsterdam, The Netherlands. Captioning with Masked Transformer. arXiv:1804.00819v1.
Zeng, P., Zhang, H., Song, J., & Gao, L. (2022). S2-Transformer for Image Captioning. Zhou, L., Kalantidis, Y., Chen, X., Corso, J. J., & Rohrbach, M. (2018). Grounded Video
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Description. arXiv preprint: arXiv:1812.06587.
(pp. 1608-1614). Vienna, Austria. Zhou, L., Xu, C., & Corso, J. (2018). Towards automatic learning of procedures from web
Zha, Z.-J., Liu, D., Zhang, H., Zhang, Y., & Wu, F. (2022, October). Context-Aware Visual instructional videos. Proceedings of the Thirty-Second AAAI Conference on Artificial
Policy Network for Fine-Grained Image Captioning. IEEE Transactions on Pattern Intelligence (pp. 7590-7598). Louisiana, USA: ACM Digital Library.
Analysis and Machine Intellegence, 710-722. Zhu, L., & Yang, Y. (2020). ActBERT: Learning Global-Local Video-Text Representations.
Zhang, J., & Peng, Y. (2019). Hierarchical Vision-Language Alignment for Video In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp.
Captioning. In Internation Conference on Multimedia Modeling (pp. 42–54). Springer. 8746–8755). Virtual Mode: IEEE.
Zhang, J., & Peng, Y. (2019). Object-aware Aggregation with Bidirectional Temporal Zitnick, C. L., Parikh, D., & Vanderwende, L. (2013). Learning the Visual Interpretation of
Graph for Video Captioning. In IEEE/CVF Conference on Computer Vision and Pattern Sentences. In IEEE International Conference on Computer Vision (ICCV) (pp.
Recognition (CVPR) (pp. 8327–8336). Long Beach Convention & Entertainment 1681–1688). Sydney, Australia: IEEE.
Center: IEEE.
60

9 - 23 - Evolution of Visual Data Captioning Methods, Datasets, and Evaluation Metrics A Comprehensive Survey

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

9 - 23 - Evolution of Visual Data Captioning Methods, Datasets, and Evaluation Metrics A Comprehensive Survey

Uploaded by

Copyright:

Available Formats

Expert Systems With Applications 221 (2023) 119773

Contents lists available at ScienceDirect

Expert Systems With Applications

Evolution of visual data captioning Methods, Datasets, and evaluation

1. Introduction systems, health care systems (Pavlopoulos, Kougia, & Androutsopo,

Fig. 7. Taxonomy of Classification of Image Captioning Techniques.

1131 IAPR TC-12 complex image modifications.

Expert Systems With Applications 221 (2023) 119773

existing vocabulary from in-domain knowledge. With lesser out-of-

Therefore, grammatical or syntactical error-free description of images

2014) (Venugopalan et al., 2016b) generated descriptions of attributes

using factual descriptions without considering the stylized part of the

image from other patterns. Stylized captions are considered to be more

expressive and attractive in comparison to the flat description of the

generated images. Fig. 9 shows the block diagram representation of the

stylized image captioning task. It mines information from images using a

improve the model.

CNN-based encoder. A text corpus is prepared that extracts various

stylized concepts such as romance, sentiments, etc. From the informa­

language generation block generates attractive captions. This technique

technique known as StyleNet that generated attractive captions with the

chitecture and outperform the existing approaches with the Flickr­

Style10K dataset which contains 10 K Flickr images with humorous and

romantic captions. Attractive and stylized captions can be generated

with the use of multitasking sequence-to-sequence training by identi­

fying style factors. In our day-to-day conversations, decision-making,

known as SentiCap which can generate image captions using positive

and negative sentiments. This method combined two CNN + RNNs

training sentences which consisted of sentiments and produced 86.4%

positive captions. Stylized image captioning has gained popularity in the

standard factual image caption dataset and a multi-stylized language

Expert Systems With Applications 221 (2023) 119773

a style-dependent caption generator, a caption discriminator, and a style

relevant style knowledge from the memory module with an attention

versity of captions (Heidari et al., 2020) presented a framework named

Mixture of Recurrent Experts (MoRE) that derived SVD from weighting

incorrectly in some cases

matrices of RNN. This model generated diverse and stylized descriptions

of images also in terms of content accuracy. (Li et al., 2021) described

phrases from small-scale stylized sentences and graft them to large-scale

taches style presentations from a large stylized text-only corpus and

can encode 2D position and

Zhao, & Luo, 2022) defined reinforcement learning model based on

stylized image captioning that learns multiple cooperative neural

Luo, 2022) successfully provides an improvement in terms of relevancy,

stylishness, and fluency of the generated captions.

in description and/or sentiments or both. This is due to limited available

large-scale datasets by covering a wide range of diverse human emo­

3.1.2.1.2, and 3.1.2.1.3, generate one sentence to describe an image.

captions by single sentence ICT, Novel object ICT, and stylized-based

ICT are limited by specific grammar. Whereas, the Dense or Paragraph

Image Captioning Task opens up a new dimension of coherent and dense

image descriptions. It extracts densely correlated features considering

be handled while generating captions by other ICTs. The procedural

(Chen, Ding, Zhao, &

The outputs of Step 2 are used by a language model to generate captions

dense captioning technique. An example of image paragraph generation

the description for each region and each description is independent,

while an image paragraph can generate the paragraph with related

sentences. advancements in dense image captioning are related to precise feature

for the generation of

with extra geometric information diversity.

Expert Systems With Applications 221 (2023) 119773

stylized concepts such as romance, sentiments, etc. From the informa

chitecture and outperform the existing approaches with the Flickr

with the use of multitasking sequence-to-sequence training by identi

large-scale datasets by covering a wide range of diverse human emo

extraction of features, an attribute extraction model, an RNN for pre

3.1.2.2.3. Encoder-Decoder Architecture without Attention Mecha