Multimodal Tweet Classification in Disaster Response Systems Using Transformer-Based Bidirectional Attention Model

Neural Computing and Applications
https://doi.org/10.1007/s00521-022-07790-5 (0123456789().,-volV)(0123456789().
,- volV)
ORIGINAL ARTICLE
Multimodal tweet classification in disaster response systems using

transformer-based bidirectional attention model
Rani Koshy1,2 • Sivasankar Elango1
Received: 12 March 2022 / Accepted: 6 September 2022

The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022
Abstract
The goal of this research is to use social media to gain situational awareness in the wake of a crisis. With the developments
in information and communication technologies, social media became the de facto norm for gathering and disseminating
information. We present a method for classifying informative tweets from the massive volume of user tweets on social
media. Once the informative tweets have been found, emergency responders can use them to gain situational awareness so
that recovery actions can be carried out efficiently. The majority of previous research has focused on either text data or
images in tweets. A thorough review of the literature illustrates that text and image carry complementary information. The
proposed method is a deep learning framework which utilizes multiple input modalities, specifically text and image from a
user-generated tweet. We mainly focused to devise an improved multimodal fusion strategy. The proposed system has a
transformer-based image and text models. The main building blocks include fine-tuned RoBERTa model for text, Vision
Transformer model for image, biLSTM and attention mechanism. We put forward a multiplicative fusion strategy for
image and text inputs. Extensive experiments have been done on various network architectures with seven datasets
spanning different types of disasters, including wildfire, hurricane, earth-quake and flood. Several state-of-the-art
approaches were surpassed by our system. It showed good accuracy in the range of 94–98%. The results showed that
identifying the interaction between multiple related modalities will enhance the quality of a deep learning classifier.
Keywords Disaster tweet classification Multimodal data fusion BiLSTM Attention RoBERTa Vision transformer
1 Introduction disaster happened. The unpredictability of the emergency

makes things worse. This is the scenario where disaster
Disasters can happen any time, anywhere. They cause response systems play a vital role. Quick response and
massive destruction and are sometimes uncontrollable. rehabilitation is the key to the success of a disaster man-
Human-made disasters, such as industrial explosions, agement system. Delayed service means no service.
armed conflicts and riots result knowingly or unknowingly Crisis informatics or disaster informatics [36] is an area
by humans. Natural disasters like earthquakes, drought etc., of study exploiting information and technology in different
occur due to natural phenomena. In whatever form it is, phases of disasters and other emergencies. It focuses on
lives and livelihood of people, the environment and the interconnecting people, technology, organizations and
economy are proportionally affected by the intensity of the information during crises/disasters. It views the citizenry as
a powerful, collectively intelligent and self-organizing
force that can play a transformational and indispensable
& Rani Koshy
406320002@nitt.edu role in emergency management.
In disaster informatics, data is the driving force. Social
Sivasankar Elango
sivasankar@nitt.edu media sites have grown in popularity as a result of
advances in information and communication technology.
1
Department of Computer Science and Engineering, National They can be considered as a significant source of real-time
Institute of Technology, Trichy, Tamilnadu 620015, India information. In today’s world, the first to report on an issue
2
Department of Computer Science and Engineering, College is always social media [30].
of Engineering, Trivandrum, Kerala 695016, India
123
Physical sensors, human sensors, and social media users different classification model. Individual classification
are all data sources for social media analysis [37]. In the model decisions are merged in some way to arrive at a final
proposed system, we focus on data supplied by social judgement. Early Fusion: Here, fusion is taking place at the
media users, particularly the general public. Early warning feature level. Different features will be merged in some
and event detection, situational awareness, resource col- manner and subsequently processed.
lection and dissemination and post-disaster analysis are We explored a variety of text and image models, as well
some ways that social media can be used in disaster as experiments, to better understand how text and image
informatics. features interact. For text and image processing, we used
In the proposed work, we concentrate on getting situa- cutting-edge transformer-based models. To identify the
tional awareness from the available data. For this, we have interaction between different input modalities, we put
to identify whether the social media content is informative forward an early fusion strategy.
or not informative. Situational awareness is when you are We performed extensive experiments with CrisisMMD
aware of what is going on around you [11] or—in a tech- [2] which is a real-world dataset, collected from Twitter
nical manner—as ‘‘the perception of the elements in the during 7 major natural disasters occurred at different parts
environment within a volume of time and space, the of the world. The experimental setup is the same for all the
comprehension of their meaning and the projection of their experiments to get comparable results. We tested our sys-
status in the near future’’ [10]. tem in two scenarios to confirm its viability in real-world
Social media analysis has numerous obstacles in this era use. In-domain Classification: The same dataset’s frag-
of data deluge. The data on social media is overflowing ments are used for both training and testing. Cross-domain
with noisy data. When a crisis occurs, social media Classification: The classifier is tested on a different dataset
becomes overburdened with messages, making it impossi- after being trained on one. The results showed that the
ble for rescuers to keep track of them. It’s challenging to proposed approach is more robust than baseline models and
pick out messages that require immediate attention from single-modality systems.
the massive volume of texts. The nonstandard terminology Even though the proposed approach showed good
and brevity of social media material also make it difficult results, there are still chances for improvement. With
to find important information. In addition to textual mate- greater hardware resources, we can expand the system’s
rial, users can contribute images and videos from the crisis capacity. Deep learning architectures need a lot of
region, which can improve situational awareness. The resources, including greater memory and computing power.
studies [2, 40–42, 50, 51] say that single modality analysis The experiments are constrained by the resources available.
is not sufficient for getting good results. The proposed system helps to improve the situational
In the past, the majority of social media research awareness for disaster response and recovery operations,
focused on text-only or image-only data. However, when which benefits people’s quality of life. Our findings are
image and text data are combined, they reveal comple- useful to research disciplines that require diverse inputs
mentary information. Multimodal data analysis is an active from several modalities, such as fake news detection,
open research area. Using various modalities provides question-answering systems and so on.
more contextual information, allowing more robust learn- The contributions of this research are listed below:
ing. Thus it necessitates an information processing system
• An extensive comparative analysis of various textual,
which can automatically identify disaster-relevant tweets
visual and multimodal systems for classification task in
by considering both text and image. Figure 1 displays some
in-domain and cross domain scenarios is performed.
tweets from recent emergency situations that include text
• Transformer-based image and text processing models
and related images.
are utilized for multimodal fusion systems.
This research aims to develop a classification system
• A robust deep learning neural network architecture is
that considers both the text and the associated images to
proposed that has a novel multi-modal feature fusion
determine whether a tweet is informative or not. Identify-
layer and some of the most recent deep learning
ing disaster-relevant information is an essential require-
techniques such as BiLSTM, ViT model and RoBERTa
ment for authorities to deliver efficient and on-time
model as the building blocks. The proposed model can
recovery operations.
exploit the hidden interaction between multiple modal-
Multimodality can be dealt with a variety of ways,
ities for the automatic detection of informative tweets
including integrating various models at the output
for a disaster response system. Later, the tweet can be
level(referred to as decision-level or late fusion) or at the
used for disaster recovery and mitigation.
feature level (referred to as feature-level or early fusion)
[14]. Late Fusion: Fusion takes place at the decision- The arrangement of the paper is as follows: Sect. 2 gives a
making level. For each input modal, there will be a quick summary of some of the most noteworthy disaster-
123
Fig. 1 Users’ post during disasters
related works. The necessary background details for shown the utility of Twitter text and images in disaster
building up the multimodal fusion system are elaborated in recovery operations.
Sect. 3. Section 4 illustrates the approach and architecture
of the proposed system. Section 5 lays forth the experi- 2.1 Text-based classification
mental setup and assesses the outcomes. Finally, Sect. 6
gives a conclusion and prospects of the system. Sreenivasulu et al. [25] presented a work with random
forest classifier for identifying damage-related tweets.
They utilized lexical features, syntactic features and fre-
2 Related works quency of the words related to damage assessment. Linear
regression and support vector regression are used to weigh
In this section, we discuss previous research initiatives that different features. They achieved accuracy up to 94% on
have been relevant to our work. There are text-only anal- datasets of earthquakes in Italy, Chily and India floods. In S
yses, image-only analyses and multimodal analyses. In Madichetty et al. [29] a stacked Convolutional Neural
some multimodal systems, inputs from multiple sources Network (CNN) is proposed for recognising tweets con-
have been taken. E.g., different sensors (temperature sen- veying resource requirement and availability. Crisis word
sors, humidity sensors), data from weather departments embedding is used here. They concatenated the output of a
(wind speed, rainfall level) etc. In the proposed approach, K-nearest neighbour (KNN) classifier and a CNN classifier.
we take multiple inputs of different modalities from the The concatenated output is finally classified by a support
tweet. It provides the advantage of quick and immediate vector machine (SVM) classifier. They achieved accuracy
data availability in the event of a calamity. There is no in the range 67–77% on datasets for different disasters.
need to use data from other sources. Several studies have Snyder et al. [45] put forward an interactive learning
123
framework for situational awareness classification. The MobileNetV3 with feed forward neural network (FFN)
user can iteratively interact with the system. Whenever the added for classification. They developed a dataset for
classification goes wrong, the user corrects the classifier. natural disasters (wildfire, flood, earthquake and volcanic
Only the tweet text is considered here. As the building eruption) and disaster intensity levels (severe, moderate
blocks, they employed Word2Vec embedding, CNN, and and insignificant). They achieved an accuracy of 96.8% for
Long Short Term Memory (LSTM). This system is inte- disaster type and 93.2% for intensity level.
grated with the SMART2.0 toolkit. Kyrkou and Theocharides [21] developed ERnet, a
To aid victims who require medical assistance, Sreeni- computationally efficient CNN model with residual con-
vasulu et al. designed a majority voting-based ensemble nections for aerial image classification suitable for
classifier which can identify medical resource tweets [28]. unmanned aerial vehicles (UAV). They introduced an
They achieved 82.4% accuracy. Zahra et al. [52] used a aerial image dataset, including images of fire, flooding,
random classifier with linguistic characteristics to identify ruined buildings and traffic accidents. They achieved an
eye-witness messages. They defined features exclusive to a accuracy of 90.1% with a reduced memory requirement.
direct eye-witness, such as words indicating perceptual
senses, first-person pronouns and adjectives. 2.3 Multimodal approach
In Ghafarian et al. [13] an approach based on a lin-
guistic concept known as ‘distributional hypothesis’ is Rizk et al. [39] developed a multimodal framework with
proposed. A tweet is modelled as a distribution of words. two stages suitable for energy-constrained devices to pro-
An SVM classifier predicts a label for a distribution. They cess social media tweet texts and images. Level 1 classi-
showed that the idea is superior than the bag-of-words fiers process image and text. These classifiers’ decisions
(BOW) model. They tested their system with several are combined and used to train level 2 classifier. They
datasets and achieved an accuracy of (74–80)%. Kejriwal attained an accuracy of 92.43%.
et al. [19] proposed a system to detect disaster-related Mouzannar et al. [32] developed a multimodal deep
urgent messages using minimally-supervised approach. learning model to identify damage-related information. For
They trained the system with labelled and unlabelled image processing, pretrained Inception model is used, and
tweets. This approach is suitable for adapting to a new for text processing, CNN-based neural network is used.
crisis. Finally, the text features and image features are combined
and given to the FFN classifier. They achieved an accuracy
2.2 Image-based classification of 92.62%.
Mohanty et al. [31] conducted a case study of Hurricane
Alam et al. [1] developed the Image4Act, a deep neural Irma. For a relevancy classifier, they proposed a multi-
network framework for image classification. They utilized modal technique. They used four different classifiers: a
social media images posted during a disaster to get the classifier to identify tweets with geospatial attributes, an
situational awareness. They fine-tuned VGG16 and image classification model, a user authenticity classifica-
achieved 67% accuracy on relevancy classification. tion model and a text classification model to find tweets
Alam et al. [3] implemented image classification for that were just about Hurricane Irma. The results of the four
damage level assessment. The images from disaster areas models are combined. If the score exceeds a predetermined
will be very messy and difficult to understand, even for level, the social media post is classified as disaster relevant.
humans. They implemented real-time image capturing, They used a decision-level fusion method.
deduplication and information extraction. For image clas- Another multimodal approach was proposed by Kumar
sification and deduplication, they used the VGG16 model et al. [20] to classify disaster-related informative content.
and the perceptual hashing technique, respectively. Their The feature vector generated by LSTM and VGG16 is
results showed that by utilizing social media imagery, concatenated and further passed through an FFN. They
emergency responders can deliver relief efforts effectively. achieved F1-score ranging from 0.61 to 0.92 for various
Chaudhuri et al. [7] developed a CNN model to classify datasets.
images having human body parts out of the debris. With Madichetty et al. [26] proposed a multimodal approach
this system, the emergency responders can get information utilizing BERT language model and DenseNet image
about trapped survivors. Their system was suited for a model to analyze tweet text and associated image. They
smart city environment. They achieved an accuracy of implemented a late fusion approach. The output probability
83.2%. vectors from the text model and image model are averaged.
Valdez and Godmalin [49] proposed a lightweight CNN This value is taken for prediction.
with two classification heads for identifying the type and Ofli et al. [34] proposed a deep learning neural network
intensity level of natural calamity. They used a fine-tuned that realizes a multimodal approach with image
123
classification using VGG16 and text classification using 3.2 Word embedding
word2vec and CNN. Image feature vector and text feature
vector are concatenated and given to an FFN for final Word embedding is a distributed learned representation for
classification. They achieved an accuracy of 78.4%. text in the form of real-valued vector. The words having
Existing multimodal fusion approaches are not promis- the same meaning will have a representation that is more or
ing for handling complex multimodal and high-dimen- less similar. This is one of the key breakthroughs in deep
sional data. This is due to the fact that: learning techniques for Natural Language Processing
(NLP).
• Existing systems consider only a single modality input
and learn the pattern in that modality. They cannot
3.2.1 RoBERTa
identify relationships across different input modalities.
• The existing systems are unable to prioritise different
BERT (Bi-directional Encoder Representations from
features in the order of importance. They value all
Transformers), introduced by Google Brain in 2018 [8], is
features equally.
a milestone in the realm of NLP. It is pretrained on
As a solution, we require a framework for classification BookCorpus and Wikipedia. It outperformed several NLP
techniques based on feature-fusion that identify crucial systems. The basic building block of BERT is transformer,
latent associations in input data from different modalities. which is an improvement over traditional encoder-decoder
It must place varying degrees of emphasis on various ele- systems. Later, several improvements over BERT in terms
ments (e.g., different words, different regions of images, of performance and training speed were proposed, such as
etc.) depending on their importance. XLNet, RoBERTa, DistilBERT etc. RoBERTa, introduced
by Facebook, outperformed BERT in over 20 NLP tasks on
GLUE benchmark datasets.
3 Preliminaries In the proposed approach, RoBERTa [22] model is used
to generate word embedding. We experimented with sev-
This section provides an overview of the key components eral word embedding techniques, and the results are shown
of the proposed system. The proposed system constitutes a in the Table 2. RoBERTa showed promising results.
text preprocessing module, RoBERTa (Robustly Opti- RoBERTa is a transformer-based NLP model. It generates
mized BERT Pretraining Approach) text model, ViT dynamic word embedding.
(Vision Transformer) image model, BiLSTM (Bi-direc-
tional Long Short Term Memory) and attention module. 3.3 Fine-tuned RoBERTa
3.1 Text preprocessing Fine-tuning is an approach to transfer learning. In a real-

world scenario, it is difficult to hold the assumption that the
Text is an unstructured form of data. Text preprocessing training dataset and the test dataset follow the same dis-
cleans up the raw data, making it more precise and man- tribution and have the same statistical property. This con-
ageable for the text processing system [15]. Typically, a straint can be overcome by transfer learning. In [6] transfer
social media tweet will be informal and contains several learning is described as a technique to transfer and apply
redundant information. To mitigate these issues, prepro- the knowledge acquired from one task to perform another
cessing is done. From the extensive experiments we had task.
done, we found that preprocessed text gives better perfor- We can employ a pretrained model for a related but
mance [44]. Common text preprocessing methods includes: different task through fine-tuning. Then, the model will
• Expanding contractions start with the learned weights of the pretrained model and
• Changing the case of the letters to lower case later be trained to learn the task-specific features. Hence,
• Removing punctuations the system gets a good starting point rather than a ran-
• Removing stopwords domized state. Fine-tuning is particularly useful where the
• Removing URLs labelled dataset is scarce. The benefits of fine-tuning are:
• Stemming and lemmatization • Overfitting is less likely to occur.
• Removing extra spaces • The model converges faster with reduced computational
resources and time.
The advantage of fine-tuning comes from the fact that the
pretrained model has a deep contextual understanding.
With fine-tuning, the knowledge acquired by the pre-
123
trained model can be considered for solving the current parameters of forget gate, input gate and output gate of a
problem. single cell in LSTM. r; , þ and k denote the sigmoid
In [48], the authors have used the technique of fine- function, element-wise multiplication, element-wise addi-
tuning to utilize the ALBERT language model for cyber- tion and concatenation operation respectively. xt denotes
bullying analysis of social media data. Their system has the input data at time t. ht is the hidden state at time t. C~t
achieved state-of-art results with an F1 score of 95%. In and Ct represent the candidate cell state and the final cell
[33], the authors used a fine-tuned BERT model for the state respectively at time t.
sentiment analysis of the Indonesian user reviews about The forget gate determines what information is to be
mobile apps. Their experiments showed overfitting beha- removed and what information is to be retained from the
viour when the model was trained from scratch. They got previous states, the input gate determines what information
state-of-the-art results with Indo-BERT-Base, a BERT is to be taken into consideration from the current inputs,
variant pretrained in the Indonesian language. In [35] the and the output gate determines what information is to be
authors implemented an NLP system for multilingual passed to the next time step. An LSTM cell has two states:
grammatical error correction. The results showed that fine- hidden state and cell state. All these things are manifested
tuning produces better models with a relatively smaller with the help of sigmoid and tanh functions.
dataset and reduced computational power consumption.
I ¼ ht1 k xt ð1Þ
In our use case, a properly labelled dataset is lacking at
the onset of a disaster. We used the pre-trained RoBERTa ft ¼ rðWf I þ bf Þ ð2Þ
model, which is later fine-tuned for our task-specific
dataset. it ¼ rðWi I þ bi Þ ð3Þ
The key characteristics of RoBERTa are: C~t ¼ tanhðWc I þ bc Þ ð4Þ
• It generates contextualized word embedding.
Ct ¼ ft Ct1 þ it C~t ð5Þ
• It takes into account the bidirectional transformer
concept so that it can utilize past and future contexts. ot ¼ rðWo I þ bo Þ ð6Þ
• Pretrained with 10 times more data and 8 times larger
ht ¼ ot tanhðCt Þ ð7Þ
batches than BERT.
• Byte-pair encoding is used for tokenization. So, it can BiLSTM is a sequence processing neural network model. It
recognize rare words and out-of-vocabulary tokens. is the improved variant of RNN and LSTM. It is capable of
• In BERT, randomly chosen 15% masks used in the solving the vanishing and exploding gradient problem.
pretraining is fixed for the entire process. In RoBERTa, BiLSTM is made up of two LSTMs. One LSTM pro-
masks are dynamically changed during the training cesses text in the forward direction (from left to right),
time. while the other processes text in the backward direction
(from right to left). A high-level view of BiLSTM is shown
3.4 BiLSTM in Fig. 3.
wt: word embedding of tth word
The basic building block of Bi-directional Long Short wft: representation of wt by forward LSTM
Term Memory (BiLSTM) is Long Short Term Memory wbt: representation of wt by backward LSTM
(LSTM). LSTM was introduced as a replacement to
In the forward direction, the output of forward LSTM is
Recurrent Neural Network (RNN) to solve the problem of
determined by the current input word vector and the pre-
vanishing gradient.
ceding hidden vector. Similarly, in the backward direction,
LSTM has three blocks: forget gate, input gate and
the output of backward LSTM is determined by the current
output gate [17]. Other than the addition of gating mech-
input word vector and the prior hidden vector. The output
anism, the concept of LSTM is similar to RNN. The gating
of forward and backward LSTMs are then concatenated to
mechanism helps to remove irrelevant parts and retain
produce the final feature vector. As a result, the available
relevant parts from the previous states. Consequently,
contextual information is improved. This architecture
LSTM is able to beat the vanishing gradient (i.e. evading
materializes the idea that the meaning of a word is deter-
information) problem.
mined by the words that come before and after it. Hence,
Figure 2 shows a high-level architecture of LSTM.
BiLSTM is thought to generate a feature representation that
Equations 1 to 7 show the mathematical formulae for the
captures more information. This leads to improved
forget gate, input gate and output gate. ft ; it and ot represent
learning.
forget gate, input gate and output gate at time step t.
Wf ; Wi ; Wo ; bf ; bi and bo indicate the weight and bias
123
Fig. 2 LSTM
of extracting complicated features that accurately charac-

terise an image. Downsampling is accomplished via pool-
ing layers. Thereby, CNN is resistant to scale, rotation,
distortion and occlusion.
ImageNet, CIFAR and MNIST are some of the promi-
nent image datasets publicly available. Using these data-
sets, some pretrained image models were developed by
researchers which showed better performance. VGG16,
ResNet152, InceptionV3, EfficientNet and DenseNet are
some of them.
CNN’s inability to understand how the features relate to
one another is their pitfall. To capture long-range depen-
dencies, large convolutional kernels are required. However,
the computing efficiency decreases. Later, the concept of
transformer became popular among researchers, and it
proved effective in retaining long-term dependency and
parallel processing. The transformer-based models are built
around the attention mechanism. With the attention layer,
Fig. 3 BiLSTM
global data is accessible by the transformer to generate the
token embeddings. Section 3.6 discusses the importance of
3.5 Image models attention mechanism.
Several studies have shown that utilizing tweet text along
3.5.1 Vision transformer
with the associated images produces better results
[20, 26, 31, 32, 34, 39]. We can think of the image as a
Vision Transformer, or ViT, is an image processing model
supplementary data to the information in the tweet text.
with a transformer-like architecture [9]. It replaces con-
Accordingly, if we can extract image features efficiently,
volutions with transformer. Multi-head self-attention layer
the deep learning system’s capability will be increased. It
is the key component in the transformer module. In addi-
will boost the amount of contextual data available.
tion, the transformer has a multi-layer perceptron with
We need image features that meaningfully and precisely
GELU activation function and normalization. Some resid-
represent the information encompassed in the image.
ual connections are also incorporated to improve long-
Convolutional Neural Network (CNN) is the most popular
range dependency. The residual connections help the free
deep learning algorithm for image feature extraction. CNN
flow of data bypass the non-linear activation func-
has predicated on the premise that image understanding at a
tions. Figure 4 illustrates the architecture of Vision
local level is sufficient. CNN comprises convolutional
Transformer as depicted by Dosovitskiy et al. [9].
layers and pooling layers. Convolutional layers are capable
123
Transformer is a sequence processing algorithm. So, the 4 Proposed system

image is split into fixed-sized patches, and these patches
are flattened to vectors. As a result, the image is trans- 4.1 Overview
formed into a sequence of tokens. These tokens, along with
positional embedding, is input to the transformer encoder. We present a neural network architecture that considers
CNN models use pixel-array, whereas ViT uses a series of both text and image to provide improved precision in
visual tokens. obtaining situational awareness from the massive volume
of social media conversational data in the context of a
3.6 Attention mechanism disaster.
The problem statement can be defined as follows:
Attention mechanism [4] is one of the most amazing Given a social media tweet Z ¼ fx; yg containing both
breakthroughs in both computer vision and NLP in the past textual information, x, and visual information, y, we need
decade. This is the cornerstone in all transformer-based to determine whether Z has situational information or not.
systems, such as BERT, XLNet, ViT, CoATNet etc. We can consider this as a classification task. This can be
As the term indicates, the attention mechanism makes formulated mathematically as Eq. 10.
the model concentrate on the most relevant parts of the
a^ ¼ argmax pðajx; y; hÞ ð10Þ
available data while ignoring the other parts. It mimics the a2A
most important cognitive process of the human brain.
It was introduced in the paper [4] by Bahdanau et al. in
2014. Initially, it was used for neural machine translation, x: text
but later its strength in other deep learning models was also y: image
successfully proved. h: parameters
The attention concept is materialized by a feed forward A: set of classes
neural network (FFN). The basic idea behind the attention a^ : predicted class
mechanism is to give varying weights to different parts of
Later the tweet containing situational content can be
the data. The FFN generates an attention score for each of
used by disaster management organisations to accelerate
the encoded input vectors, with the most relevant vectors
the recovery operations.
being attributed the highest attention weights. The attention
We need a multimodal joint representation, /ðx; yÞ, that
score indicates the relevance of each input element with
robustly exploits the interaction between textual features
respect to the task. This score is then normalized by a
and image features. Using this collective representation,
softmax operation. Later, the context vector of the input
equation 10 can be learned efficiently.
data is formed by combining (usually summation) the
A high-level view of the overall system architecture is
attention-weighted input vectors. The following equations
depicted in the Fig. 5.
show the entire process.
Major modules in the proposed system are:
expðv ht Þ
at ¼ P ð8Þ • Text Feature Extraction Module: extracts the informa-
t expðv ht Þ
X tion from the tweet text
S¼ at :ht ð9Þ • Image Feature Extraction Module: extracts image
t features from the associated image
• Fusion Module: combines the features given by the
feature extraction modules. In the proposed system, we
ht : Hidden vector of input word at time t used the early fusion technique.
at : Normalized attention weight for ht • Classification Module: Finally, a feed forward neural
S : Attention-weighted context vector for the sentence network (FFN) is employed for the classification part.
v is a trainable parameter which is randomly initialized Figure 6 shows the detailed architecture of the proposed
and jointly learned by the system to identify the most system.
attentive word. Over time, several variants of attention The proposed system achieved superior performance
mechanism according to the score calculation and context over state-of-the-art systems.
vector generation came out.
123
Fig. 4 Vision transformer
Fig. 5 Overall system

architecture
4.2 Feature extraction modules • removal of duplicate tweets, stopwords, punctuations,

symbols, retweet tags, HTML tags, URLs and
This section describes the methodology used for extracting emoticons.
features from different input modalities. • conversion to lower case letters
We built two feature extraction modules: one for text • replacement of contractions: Typically, some word
and one for images. Later, a fusion module is used to get contractions are used in social media tweets. We had
the joint representation of extracted features. To obtain the made an attempt to replace these contractions with the
final classification result, the fusion module’s output is actual word.
routed through an FFN. We used fine-tuned ViT model for The preprocessed text is further undergone the word
visual feature extraction and RoBERTa model for text embedding process.
feature extraction.
1. Word embedding by RoBERTa model
4.2.1 Textual feature extraction: BiLSTM neural network In the proposed approach, word embedding is
with fine-tuned RoBERTa achieved with fine-tuned RoBERTa.
Ft ¼ \ft1 ; ft2 ; . . .; ftn [
The text features are extracted using a BiLSTM neural Ft represents the tweet text vector after RoBERTa
network together with fine-tuned RoBERTa model. embedding, where n denotes the number of words in a
Tweet text is undergoing a series of transformations to tweet text. We used n=60. In the time of a crisis, people
get fine-grained features. As the first step, text prepro- usually send brief tweets. Each fti is a vector of 768
cessing is done. dimension.
We have used text normalization procedures, including: fti 2 Rdw
dw denotes the dimension of word vector, which is
123
X 2 RHWC ð11Þ
2
CÞ
Xp 2 RNðP ð12Þ
HW
N¼ ð13Þ
P2
H: Height of the image

W: Width of the image
(H, W): the resolution of the image
C: the no. of channels
(P, P) : the resolution of the patch
N: the number of patches
Let D be the transformer hidden vector dimension. Hence,
the 2d image patches, Xp , are converted to patch embed-
ding vectors of dimension D, RND . Similar to BERT, a
trainable class token, CLS, is prepended to the embedded
patches. Hence, there are N?1 patches. In addition to this,
since the transformer does not have any indication of token
position, some positional embeddings, Epos , are also
attached with the patch embeddings.
Epos 2 RðNþ1ÞD
The hidden vector size in ViT_b32 is 768. [9] can be
referred for further details on ViT model.
In the proposed system, we take the output of the
penultimate layer of the ViT. Since this is a classification
task, we are capturing only the output vector of the CLS
token, i.e. the vector of size 768 corresponding to the extra
Fig. 6 Proposed system architecture learnable class embedding entity. The final embedded
vector I 2 R768 is passed to the multimodal interaction
of size 768 generated by RoBERTa. layer.
The RoBERTa embeddings are processed further
using BiLSTM. 4.3 Multimodal fusion module
4.2.2 Visual feature extraction Multimodal fusion is the process of combining data from
many input modalities into a single unit. It exploits the
In the proposed approach, we are using Vision Transformer complementarity of heterogeneous inputs.
(ViT) [9], a transformer-based image classification model. We propose a multiplicative fusion approach with a
The authors of [9] claim that ViT achieved 4 times better BiLSTM followed by attention mechanism.
results when compared with the state-of-the-art CNN
models. 4.3.1 Multimodal interaction layer
We have experimented with CNN models and ViT. The
results are shown in the Table 3. ViT delivered a good This layer combines inputs of multiple modalities.
performance. In the proposed system, early fusion with multiplicative
In the proposed approach, we used the ViT_b32 model. merging is done. Early fusion can be considered as gen-
It features 12 layers of transformer encoders, and thus 12 erating fine quality features.
attention heads. Transformer can take only 1d input. Both feature extraction module generates feature vectors
Therefore, the image, X, is converted from the dimension of dimension 768. To reduce the dimensionality of feature
of H W C to a sequence of flattened 2d patches, Xp vectors, a nonlinear mapping is done by a feed forward
having the dimension of N ðP2 CÞ [9]. neural network. Then merging is done by multiplying each
123
pair of elements from both feature vectors. Equations 14– 4.4 Classification module
16 shows the fusion operations implemented.
0
ð14Þ The contextual feature vector, C, is fed to a fully connected
T ¼ f ðT W0 þ b0 Þ
FFN having a softmax activation function. The classifier is
0
I ¼ f ðI W1 þ b1 Þ ð15Þ defined as follows:
TI ¼ T I
0 0 y~ ¼ argmax pðyjCÞ ð20Þ
y2Y
768
T; I 2 R
0 0
pðyjCÞ ¼ softmaxðW C þ bÞ ð21Þ
T ; I 2 Ru
u
ð16Þ where W is the weight matrix, b is the bias vector, Y is the
du
TI 2 Rd
set of classes, y~ is the predicted label for the feature vector
W0 ; W1 2 Ru768 C, i.e. informative or not_informative.
b0 ; b1 2 Ru The whole model is trained end-to-end with a supervised
learning procedure.
where, T is the tweet text, I is the image, u is the dimension
0 0
of nonlinearly-mapped feature vectors, T and I are the
mapped text and image feature vectors, respectively.
denotes the outerproduct. W 0 ; W 1 ; b0 and b1 are weight and
bias vectors.
The weight matrix and bias vectors are jointly learned in
the network. Later, TI is passed to BiLSTM with an
attention layer.
4.3.2 Attention layer
In the proposed approach, we use the idea of attention put

forward by Bahdanau [4]. It is robust and globally draws
the relative importance of input words with respect to the
output word.
We apply BiLSTM with attention to the fused text and
image vector to extract the multimodal contextual vector.
Let M; \m1 ; m2 ; . . .; mn [ ; be the biLSTM output of
the fused feature vector. The alignment score
A; \a1 ; a2 ; . . .an [ ; for M is calculated by applying the
tanh function with a feed forward neural network. W
indicates the weight parameters, and b indicates bias
parameters. The network jointly learns W and b. The a
values in Eq. 18 symbolizes the importance of each input
feature.
Hence, the final feature vector, C, is calculated by the
weighted sum of all feature elements as in Eq. 19.
ai ¼ tanhðW T mi þ bÞ ð17Þ
expðai Þ
ai ¼ Pn ð18Þ
j¼1 expðaj Þ
X
n
C¼ ai m i ð19Þ
i¼1
Now, C represents the fused contextual fine-grained feature

5 Experimental setup
vector for the tweet.
Figure 7 shows the BiLSTM with attention layer.
The execution setup and environment used in our experi-
Next is the classification module.
ments are described in this section.
123
The results obtained and the performance of the proposed

system will be discussed in the next section.
5.1 Dataset
Since our aim is multimodal data analysis, we focused on

multimodal datasets. We used the dataset CrisisMMD
published in [2]. This dataset was used extensively in
several state-of-the-art models [12, 20, 25–27].
CrisisMMD gives a bimodal dataset consisting of text
and associated images scrapped from Twitter for seven
disasters including wildfire, earthquake and floods occurred
across different parts of the world in 2017. So, the dataset
has different linguistic and semantic characteristics. The
tweets have one of two labels: informative or non-
informative.
We collected the dataset and performed data cleaning
Fig. 7 BiLSTM with attention module
procedures and data augmentation techniques to make the
dataset meaningful and balanced. Our model was trained
To verify the feasibility of the proposed system for
and validated on all these seven datasets. Table 1 shows the
practical deployment, we performed the experiments in two
distribution of data in the dataset.
scenarios.
• In-domain Classification: The training and testing is 5.2 Hyperparameters
done on fragments of the same dataset. We had done a
comparative analysis of in-domain text-only, image- We used NVIDIA GeForce RTX 3060 with 12GB dedi-
only and multimodal analysis. Tables 2, 3, 4, 5, 6, 7, 8, cated RAM and 16 GB shared RAM.
9 and 10 show the results obtained. The models are built with the TensorFlow platform
• Cross-domain Classification: In in-domain classifica- using Python programming language. We used fine-tuned
tion, training and testing dataset have the same RoBERTa model and fine-tuned ViT model. The choice of
characteristics and statistical distribution. But, this is text and image feature extractors is based on our experi-
not possible in real-world situations. Any catastrophic mental results. RoBERTa and ViT showed better perfor-
event’s early stages are marked by a dearth of labelled mance. The details are shown in Tables 2 and 3.
data about the event itself. Using a similar previous The sequence length used for text is 60. The image size
event to train the classifier is a practical reasonable is 224x224x3. The optimizer used is ADAM with a
solution to this problem. We performed cross-domain learning rate of 1e-5 and a decay of 1e-6. We used a
experiments with our proposed model and the system stratified train-test split of 80:20 and a 5-fold cross-vali-
performed well. Table 11 shows the cross-domain dation. The classification layer is a feed forward neural
results. network having two levels of 1024 and 512 neurons and a
final layer of 3 neurons. The activation functions used in
Table 1 Distribution of data in

Sl. no. Disaster name Tweet annotation Period
different disasters
Informative Non_informative
Text Image Text Image
1 Hurricane Irma 3544 2208 960 2296 2017 Sep 6 to Sep 9

2 Hurricane Harvey 3332 2457 1102 1977 2017 Sep 6 to Sep 9
3 Hurricane Maria 2842 2231 1714 2325 2017 Sep 6 to Sep 9
4 California Wildfires 1253 985 336 604 2017 Sep 6 to Sep 9
5 Mexico Earthquake 1031 841 349 539 2017 Sep 6 to Sep 9
6 Iraq-Iran Earthquake 493 400 104 197 2017 Sep 6 to Sep 9
7 Srilanka Floods 367 252 655 770 2017 Sep 6 to Sep 9
123
Table 2 Results of various Text

Sl. no. Model Accuracy Precision Recall F1-measure ROC-AUC score
classification Models
1 RF with TF-IDF 89.67 90.0 90.0 90.0 .9618
2 SVM with TF-IDF 89.32 89.0 89.0 89.0 .9689
3 Gaussian NB 75.0 80.0 75.0 74.0 .7514
4 Stacking Ensemble with tf-idf 90.16 90.0 90.0 90.0 .9652
5 Majority Voting Ensemble 90.67 91.0 91.0 91.0 .9675
6 GLOVE with FFN 88.07 88.0 88.0 88.0 .9540
7 Word2Vec with FFN 88.75 89.0 89.0 89.0 .9594
8 BERT with FFN 90.48 91.0 90.0 90.0 .9639
9 RoBERTa with FFN 92.10 92.0 91.0 91.0 .9680
10 XLNET with FFN 91.14 91.0 91.0 91.0 .9619
11 Proposed System 97.27 97.0 97.0 97.0 .9938
Table 3 Results of various

Sl. no. Model Accuracy Precision Recall F1-measure ROC-AUC score
Image Classification Models
1 VGG16 77.47 77.0 77.0 77.0 .8550
2 VGG19 73.96 74.0 74.0 74.0 .8202
3 InceptionV3 80.51 80.0 80.0 80.0 .8833
4 ResNet50 80.56 81.0 81.0 81.0 .8884
5 DenseNet121 81.59 82.0 82.0 82.0 .8957
6 EfficientNetV2 77.10 77.0 77.0 77.0 .8559
7 ViT 82.20 82.0 82.0 82.0 .9003
8 Proposed system 97.27 97.0 97.0 97.0 .9938
the classification layers are relu and softmax functions. The learn the hyper-plane that clearly separates the classes
batch size used is 8. All the details of the proposed system with a good kernel function. But choosing the right
are given in Sect. 4. To beat the overfitting problem, we kernel is not an easy task.
used dropout regularization of 0.2 and an early stopping • Random Forest [25]: We have built an RF classifier for
mechanism for validation accuracy. text classification along with TF-IDF vectorization. RF
is a highly successful machine learning algorithm. It is
5.3 Baseline models based on a collection of decision trees, known as
’forest’, learned using the ‘bagging’ method. One
To exemplify the effectiveness of our system, we compared peculiarity of RF is that it can measure the relative
the system with certain baseline models for text classifi- importance of each feature on prediction. By randomly
cation, image classification and multimodal classification. selecting data samples and features, multiple decision
trees are built, and either the average or the maximum
5.3.1 Text-only classification vote of the results is taken.
• Stacking Ensemble [24, 29]: Multiple heterogeneous
We built tweet classification models for tweet text only. base learners are simultaneously trained. The meta
learner is trained using the base learner’s predictions as
• Gaussian Naive Bayes [38]: Gaussian Naive Bayes is a
features. An ensemble learner can outperform any
variant of the Naive Bayes classifier. It is a simple
single base model. We implemented a stacking ensem-
algorithm but has good predictive power. It is com-
ble with decision tree, random forest, KNN and
monly applied for text classification. We implemented a
XGBoost as base learners and logistic regression as
Naive Bayes’ classifier for tweet text classification with
the meta learner.
TF-IDF vectorization.
• Majority Voting Ensemble [28]: Multiple base models
• SVM [25]: We implemented a SVM (Support Vector
are built, and their predictions are combined. Finally,
Machine) classifier for tweet text classification with TF-
the system predicts the class with the most votes.
IDF vectorization. SVM is also proved to be successful
Ideally, the system will be better than any single model
in several text classification tasks. The algorithm tries to
used. We implemented a majority voting classifier with
123
AdaBoost, XGBoost, Random Forest and SVM as the is implemented.They have done a late fusion strategy
contributing models. with an additive operation.
• Deep learning models: We implemented some deep • Abhinav Kumar (2020) [20]: A multimodal informa-
learning models with the following techniques as word tive tweet classification is done with LSTM and fine-
embedding methods. tuned VGG16 with a concatenative feature fusion.
• Gautam et al. (2019) (mean probability) [12]:
– GLOVE [23]
ResNet50 for image classification and
– Word2Vec [45]
BiLSTM,CNN?GLOVE for text classification are
– BERT [23, 26]
used. The late fusion strategy is done by averaging
– RoBERTa
the class probabilities.
– XLNet
• Gautam et al. (2019) (custom decision policy) [12]:
ResNet50 for image classification and
5.4 Image-only analysis BiLSTM,CNN?GLOVE for text classification are
used. The class probabilities are averaged and given
We built tweet classification models based on image clas- to a classification system with ReLU and softmax
sification using the following pre-trained image models. functions.
• Gautam et al. (2019) (logistic regression decision
• VGG16 [1, 3, 20, 27, 43]
policy) [12]: Image feature extraction is done with
• VGG19 [12]
ResNet50 and text feature extraction is done with
• InceptionV3 [32, 46]
BiLSTM, CNN with GLOVE. Later, the features are
• ResNet50 [12, 16]
concatenated and used for classification.
• DenseNet [18, 26]
• EfficientNetV2 [47]
• ViT [9]
6 Evaluation metrics
5.5 Multimodal analysis
Accuracy, macro-average precision, macro-average recall,
macro-average F1-Score and ROC-AUC score are used as
We experimented with various multimodal fusion tech-
the evaluation metrics for our classification model.
niques and compared the proposed approach with some
existing systems. We carried out both in-domain and cross- TP þ TN
Accuracy ¼ ð22Þ
domain classifications. TP þ TN þ FP þ FN
TP
• Additive fusion Different inputs are fused by addition Precision ¼ ð23Þ
operation. This is suitable for applications that are not TP þ FP
strongly affected by the joint values of the inputs. TP
Recall ¼ ð24Þ
• Concatenative fusion Multiple inputs are fused by TP þ FN
concatenating each other. The argument in support of 2 Precision Recall
concatenation is that the inputs are not at all modified or F1 Score ¼ ð25Þ
Precision þ Recall
at least limited to some extent. So, the naturality of the
inputs can be preserved. But it cannot capture the where TP: True Positive, TN: True Negative, FP: False
interaction between multiple modalities. Positive, FN: False Negative
• Averaging The inputs are averaged for getting the We have used macro-averaged precision, recall and F1-
fused vector. Score to give equal importance to all the classes. It is the
• Multiplicative fusion Multiple inputs are fused by arithmetic mean of all distinct classes’ respective values.
multiplication operation. Multiplicative fusion is good So, we will get an average measurement per class.
Pn
at learning interaction between multiple modalities. Precisioni
Macro averaged Precision ¼ i¼1 ð26Þ
• Sreenivasulu et al. (2021) [26] : The authors have n
implemented a multimodal tweet classification model Pn
Recalli
with fine-tuned BERT and DenseNet. They have done a Macro averaged Recall ¼ i¼1 ð27Þ
n
late fusion strategy with an averaging operation. Pn
• S. Madichetty et al. (2020) [27]: (Multimodal additive F1 Scorei
Macro averaged F1 Score ¼ i¼1
fusion with VGG16 and CNN) A multimodal tweet n
classification model with a CNN and fine-tuned VGG16 ð28Þ
123
Table 4 Results of multimodal fusion strategies on California Wild-

where n is the number of classes.
fires dataset
ROC-AUC score Area Under Curve is the metric to
indicate the capability of a classifier to discriminate Model no. Acc. Precision Recall F1- ROC-AUC score
score
between classes. It is the summary of the Receiver Oper-
ating Characteristic curve. The value can be between 0 and M1 82.0 70.0 74.0 72.0 –
1. A value of 0 means that the model’s prediction is 100% M2 65.0 – – 64.0 –
incorrect, while a value of 1 indicates that the model’s M3 – 74.0 73.0 73.0 –
prediction is 100% correct. The model with an AUC score M4 79.2 – – –
between 0.9 and 1 is considered superior. M5 75.0 – – –
M6 75.3 – – –
M7 95.26 95.0 95.0 95.0 .9905
7 Results and discussion M8 95.26 95.0 95.0 95.0 .9877
M9 97.53 98.0 98.0 98.0 .9930
We tested our system with the benchmark multimodal M10 97.13 97.0 97.0 97.0 .9942
dataset, CrisisMMD. We have done extensive experiments
for text-only analysis, image-only analysis and multimodal
analysis. We combined datasets for the seven disasters in
the CrisisMMD and utilised them as a single dataset for
Table 5 Results of multimodal fusion strategies on Hurricane Harvey
text-only and image-only classification. The experimental Dataset
results for in-domain classification are displayed in
Tables 2, 3, 4, 5, 6, 7, 8, 9 and 10. Bold values in the Model no. Acc. Precision Recall F1- ROC-AUC score
score
tables indicate the best performance metric values obtained
in our experiments. Results for cross-domain classification M1 84.0 68.0 79.0 71.0 –
are shown in Table 11. Results revealed that the proposed M2 77.70 – – 77.60 –
system surpasses existing systems in terms of the afore- M3 – 83.0 82.0 82.0 –
mentioned evaluation metrics by a significant margin. M4 76.0 – – –
M5 73.5 – –
7.1 Text-only classification M6 79.2 – – –
M7 93.91 94.0 94.0 94.0 .9916
We implemented text classification using Naive Bayes, M8 96.01 96.0 96.0 96.0 .9923
SVM, Random Forest, Stacking Ensemble, Majority Vot- M9 94.61 95.0 95.0 95.0 .9904
ing Ensemble and ANN algorithms. We assessed the M10 97.13 97.0 97.0 97.0 .9942
strength of these algorithms with different text vectoriza-
tion methods including TF-IDF, GLOVE, Word2Vec,
BERT, XLNet and RoBERTa. Data augmentation was
done to get balanced class distribution in the dataset. The
Table 6 Results of multimodal fusion strategies on Hurricane Irma
results are shown in Table 2. ROC-AUC curves of the text dataset
classification models are shown in Fig. 8. Ensemble
Model no. Acc. Precision Recall F1- ROC-AUC score
methods and dynamic word embedding techniques give
score
good results. Individually, RoBERTa based FFN system
produces better results. RoBERTa has also been proved to M1 81.0 68.0 75.0 71.0 –
be more robust than other word-embedding systems in M2 73.82 – – 73.55 –
several investigations [5]. RoBERTa is a transformer-based M3 – 81.0 81.0 81.0 –
pretrained model for text processing. It has the capability M4 73.39 – – –
of bidirectional textual processing. It is able to generate M5 71.4 – –
deep contextualized dynamic word embedding because of M6 80.2 – – –
its rigorous and improved training procedures. It is trained M7 95.99 96.0 96.0 96.0 .9928
with dynamic masks and tremendous training data. We M8 95.92 96.0 96.0 96.0 .9929
utilized RoBERTa embeddings in the proposed approach. M9 93.44 94.0 93.0 93.0 .9772
Proposed approach has a gain of 5% in accuracy and 6% in M10 94.72 95.0 95.0 95.0 .9886
F1-score over RoBERTa only model. This exemplifies the
importance of multimodal systems in data analysis.
123
Table 7 Results of multimodal fusion strategies on Hurricane Maria Table 10 Results of multimodal fusion strategies on Srilanka Floods
dataset Dataset
Model no. Acc. Precision Recall F1- ROC-AUC score Model no. Acc. Precision Recall F1- ROC-AUC score
score score
M1 80.0 77.0 80.0 78.0 – M1 89.0 91.0 86.0 88.0 –

M2 72.96 – – 72.84 – M2 91.78 – – 92.00 –
M3 – 83.0 83.0 83.0 - M3 – 94.0 94.0 94.0 –
M4 74.79 – – – M4 72.6 – – –
M5 57.7 – – M5 70.8 – –
M6 79.4 – – – M6 74.14 – – –
M7 92.42 92.0 92.0 92.0 .9746 M7 96.84 97 97 97 .9924
M8 92.50 93.0 92.0 92.0 .9801 M8 94.47 94.0 94.0 94.0 .9892
M9 92.11 92.0 92.0 92.0 .9767 M9 97.23 97.0 97.0 97.0 .9971
M10 98.24 98.0 98.0 98.0 .9962 M10 97.23 97.0 97.0 97.0 .9928
Table 8 Results of multimodal fusion strategies on Iraq-Iran Earth- Table 11 F1 scores of cross domain classification with proposed
quake dataset system; D1: Hurricane Irma, D2: Hurricane Maria, D3: Hurricane
Harvey, D4: California Wildfire, D5: Mexico Earthquake, D6: Sri
Model no. Acc. Precision Recall F1- ROC-AUC score Lanka Floods, D7: Iraq-Iran Earthquake
score
Train set Test set
M1 83.0 74.0 63.0 66.0 –
D1 D2 D3 D4 D5 D6 D7
M2 68.18 – – 67 –
M3 – 79.0 79.0 79.0 – D1 95.0 84.0 88.0 81.0 78.0 90.0 82.0
M4 73.5 – – – D2 73.0 98.0 87.0 80.0 76.0 85.0 79.0
M5 80.2 – – D3 76.0 84.0 97.0 82.0 76.0 90.0 81.0
M6 75.2 – – – D4 74.0 76.0 81.0 97.0 79.0 85.0 78.0
M7 93.94 94.0 94.0 94.0 .9889 D5 74.0 72.0 74.0 72.0 98.0 70.0 79.0
M8 96.46 97 96.0 96.0 .9720 D6 77.0 78.0 81.0 73.0 80.0 97.0 82.0
M9 94.95 95.0 95.0 95.0 .9917 D7 70.0 78.0 80.0 78.0 75.0 69.0 98.0
M10 97.98 98.0 98.0 98.0 .9833
7.2 Image-only classification
Table 9 Results of multimodal fusion strategies on Mexico Earth- We implemented tweet classification with the pretrained
quake Dataset image classification models, including VGG16, VGG19,
Model no. Acc. Precision Recall F1- ROC-AUC score InceptionV3, ResNet50, EfficientNet, DenseNet and a
score transformer-based vision model, ViT. Each of the models
differs in computational complexity. Table 3 shows the
M1 83.0 76.0 81.0 78.0 –
results of various image classification models. ROC-AUC
M2 74.29 – – 74.25 –
curves are also shown in Fig. 9. The authors of ViT
M3 – 73.0 72.0 72.0 –
claimed that ViT achieved 88.55% accuracy in ImageNet
M4 77.3 – – –
1k dataset. We implemented a ViT base-32 based image
M5 74.6 – –
classification model.
M6 77.9 – – – In our experiments, ViT shows better results. It achieved
M7 92.76 93.0 93.0 93.0 .9772 an accuracy gain in the range 0–8% and F1-score gain in
M8 92.52 93.0 93.0 93.0 .9768 the range 0–8% over other models. The strength of ViT is
M9 93.22 93 93 93 .979 that it is a transformer-based model. The embedded image
M10 97.90 98 98 98 .9930 patches are output to a transformer encoder.
ViT enjoys all the benefits of transformers. It has some
residual connections to gain long-range dependency. The
123
Fig. 8 ROC AUC curves for text only classification
multihead self-attention layer in transformer enables In addition to the proposed method, we implemented
information to be embedded globally over the entire image. additive, concatenative, and averaged fusion strategies
CNN, on the other hand, is based on local filters, which together with RoBERTa and ViT. Proposed method has
results in poor performance in capturing useful patterns multiplicative fusion.
globally. Also, transformers introduce better parallelization Our experiments included in-domain and cross-domain
than CNN. scenarios.
Proposed approach uses ViT for visual feature extrac-
• In-domain classification We performed a comparative
tion. It shows a clear margin on all performance metrics
analysis of various fusion methods and existing sys-
over ViT only model, which clearly emphasizes the
tems. Details are given in Sect. 5.5.
necessity of multimodal systems.
Table 4, 5, 6, 7, 8, 9 and 10 and Fig. 11 shows the per-
7.3 Multimodal classification formance metrics of different multimodal systems tested on
seven datasets including Hurricanes—Harvey, Irma, Maria,
If either the text or the image in a tweet is informative, the Earthquakes—Iraq_Iran, Mexico, Srilanka Floods and
tweet is deemed informative. This assumption avoids the California Wildfires. Figure 10 shows the ROC-AUC
information loss. curves for different fusion strategies tested on Hurricane
123
Fig. 9 ROC AUC curves for image only classification
Maria dataset. The naming convention for models used in – M1: Sreenivasulu et al. [26] (Multimodal additive
the tables and figures is as follows: fusion with fine-tuned BERT and DenseNet)
– M2: Madichetty et al. [27] (Multimodal additive fusion
with VGG16 and CNN)
– M3: Kumar [20]
– M4: Gautam et al. [12] (Mean Probability)
– M5: Gautam et al. [12] (Custom Decision Policy)
– M6: Gautam et al. [12] (Logistic Regression Decision
Policy)
– M7: Additive fusion with RoBERTa and ViT
– M8: Concatenative fusion with RoBERTa and ViT
– M9: Averaged fusion with RoBERTa and ViT
– M10: Proposed System
In all the experiments, the proposed approach shows a
consistent performance).
The proposed approach is a multimodal data fusion
system for tweet classification which takes tweet text and
the associated image as the input. Text and image act as
complementary to each other. Each of the input modals is
processed in the context of the other one. Therefore, we
will get more contextual information. We used RoBERTa
for text feature extraction and ViT for visual feature
extraction.
We tested additive, concatenative, averaged and multi-
plicative fusion strategies. In concatenation-based fusion,
no interaction between different input modals is explored.
Fig. 10 ROC AUC curves for different fusion approaches on In additive and averaged fusion, joint value of features
Hurricane Maria dataset
123
possible pair of CrisisMMD dataset. As our focus is to

devise a multimodal fusion technique rather than
domain adaptation, we carried out less comparative
analysis in this case. We reported the F1-score in
Table 11. Non-diagonal entries display the findings
from cross domain experiments. Since F1-score is the
harmonic mean of precision and recall, it can reflect
both metrics as well. The proposed system performed
well in the cross-domain scenario also. This demon-
strates the potential of the proposed system to gener-
alize. The system offers a reliable framework for cross-
domain applications. In light of this, we concluded that
our suggested system offers a feasible, real-world
solution for tweet classification in crisis management.
8 Conclusion
We presented a neural network architecture for informative

content classification from the user-generated posts avail-
able on social media for disaster management systems. The
classification system makes use of both text and image
data. To achieve superior results, NLP, image processing,
and deep learning approaches are integrated in a productive
way. We have implemented a multimodal feature-wise
fusion in the proposed system. RoBERTa is used for text
feature extraction, while ViT is used for image feature
extraction. The system is evaluated for in-domain and cross
domain scenarios on seven multimodal disaster datasets,
including hurricanes, earthquakes, wildfires and floods.
The test results proved that the proposed system is robust.
Fig. 11 Performance metric graphs of different multimodal systems It attained a good accuracy gain over baselines.
on CrisisMMD dataset Our findings are helpful in research disciplines that
require diverse inputs from several modalities, such as fake
cannot be completely explored. In multiplicative fusion, all news detection, question-answering systems and so on. The
the possible hidden associations between vectors of dif- proposed methodology can be applied to data from any
ferent modalities may be unearthed. It can effectively social media site; even though we experimented with
capture the interaction between input modals and identify Twitter data,
matching elements in a better way. Analysis of user-generated content has several open
We applied an early fusion strategy that performs mul- research issues. Code mixing and transliteration are
tiplication between each pair of elements. In early fusion, prevalent in social media posts. We intend to examine the
feature-wise fusion is taking place. Together with this impact of code mixing and transliteration on disaster tweet
fusion method, we added BiLSTM and attention mecha- classification systems in the future.
nism also. BiLSTM helps to analyze context in both
directions. With the help of BiLSTM and attention layer,
Data availability The datasets used and analyzed during the current
we are able to get highly contextualized fine-grained fea-
study are available online: https://crisisnlp.qcri.org/crisismmd.
tures prioritized with respect to the downstream task. We
were able to achieve greater performance with this
architecture. Declarations
• Cross-domain classification We carried out the cross Competing interests The authors declare that they have no conflict of
domain experiments with the proposed system for every interest.
123
References 23. Madichetty S, Muthukumarasamy S (2020) Detection of situa-

tional information from twitter during disaster using deep learn-
ing models. Sādhanā 45(1):1–13
1. Alam F, Imran M, Ofli F (2017) Image4act: online social media
24. Madichetty S, Sridevi M (2019) Disaster damage assessment
image processing for disaster response. In: Proceedings of the
from the tweets using the combination of statistical features and
2017 IEEE/ACM international conference on advances in social
informative words. Soc Netw Anal Min 9(1):1–11
networks analysis and mining 2017, pp 601–604
25. Madichetty S, Sridevi M (2021) A novel method for identifying
2. Alam F, Ofli F, Imran M (2018) Crisismmd: multimodal twitter
the damage assessment tweets during disaster. Future Gener
datasets from natural disasters. In: Twelfth international AAAI
Comput Syst 116:440–454
conference on web and social media
26. Madichetty S, Muthukumarasamy S, Jayadev P (2021a) Multi-
3. Alam F, Ofli F, Imran M (2018) Processing social media images
modal classification of twitter data during disasters for humani-
by combining human and machine computing during crises. Int J
tarian response. J Ambient Intell Hum Comput
Hum Comput Interact 34(4):311–327
12(11):10223–10237
4. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation
27. Madichetty S et al (2020a) Classifying informative and non-in-
by jointly learning to align and translate. arXiv preprint arXiv:
formative tweets from the twitter by adapting image features
1409.0473
during disaster. Multimed Tools Appl 79(39):28,901-28,923
5. Balakrishnan V, Shi Z, Law CL et al (2021) A deep learning
28. Madichetty S et al (2020) Identification of medical resource
approach in predicting products’ sentiment ratings: a comparative
tweets using majority voting-based ensemble during disaster. Soc
analysis. J Supercomput 2021:1–21
Netw Anal Min 10(1):1–18
6. Bengio Y, Courville A, Vincent P (2013) Representation learn-
29. Madichetty S et al (2021) A stacked convolutional neural network
ing: a review and new perspectives. IEEE Trans Pattern Anal
for detecting the resource tweets during a disaster. Multimed
Mach Intell 35(8):1798–1828
Tools Appl 80(3):3927–3949
7. Chaudhuri N, Bose I (2019) Application of image analytics for
30. Martinez-Rojas M, del Carmen Pardo-Ferreira M, Rubio-Romero
disaster response in smart cities. In: Proceedings of the 52nd
JC (2018) Twitter as a tool for the management and analysis of
Hawaii international conference on system sciences
emergency situations: a systematic literature review. Int J Inf
8. Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of
Manag 43:196–208
deep bidirectional transformers for language understanding.
31. Mohanty SD, Biggers B, Sayedahmed S et al (2021) A multi-
arXiv preprint arXiv:1810.04805
modal approach towards mining social media data during natural
9. Dosovitskiy A, Beyer L, Kolesnikov A et al (2020) An image is
disasters—a case study of hurricane irma. Int J Disaster Risk
worth 16x16 words: transformers for image recognition at scale.
Reduct 54(102):032
arXiv preprint arXiv:2010.11929
32. Mouzannar H, Rizk Y, Awad M (2018) Damage identification in
10. Endsley MR (1995) Toward a theory of situation awareness in
social media posts using multimodal deep learning. In: ISCRAM
dynamic systems. Hum Factors 37(1):32–64
33. Nugroho KS, Sukmadewa AY, Wuswilahaken DWH et al (2021)
11. Flin R, O’connor P, Crichton M (2017) Safety at the sharp end: a
Bert fine-tuning for sentiment analysis on Indonesian mobile apps
guide to non-technical skills. CRC Press, Boca Raton
reviews. In: 6th international conference on sustainable infor-
12. Gautam AK, Misra L, Kumar A et al (2019) Multimodal analysis
mation engineering and technology 2021, pp 258–264
of disaster tweets. In: 2019 IEEE fifth international conference on
34. Ofli F, Alam F, Imran M (2020) Analysis of social media data
multimedia Big Data (BigMM). IEEE, pp 94–103
using multimodal deep learning for disaster response. arXiv
13. Ghafarian SH, Yazdi HS (2020) Identifying crisis-related infor-
preprint arXiv:2004.11838
mative tweets using learning on distributions. Inf Process Manag
35. Pajak K, Pajak D (2022) Multilingual fine-tuning for grammatical
57(2):102145
error correction. Expert Syst Appl 200(116):948
14. Gunes H, Piccardi M (2008) Automatic temporal segment
36. Palen L, Anderson KM, Mark G et al (2010) A vision for tech-
detection and affect recognition from face and body display.
nology-mediated support for public participation & assistance in
IEEE Trans Syst Man Cybern Part B (Cybernetics) 39(1):64–84
mass emergencies & disasters. ACM-BCS Vis Comput Sci
15. Han J, Pei J, Kamber M (2011) Data mining: concepts and
2010:1–12
techniques. Elsevier, Amsterdam
37. Phengsuwan J, Shah T, Thekkummal NB et al (2021) Use of
16. He K, Zhang X, Ren S et al (2016) Deep residual learning for
social media data in disaster management: a survey. Future
image recognition. In: Proceedings of the IEEE conference on
Internet 13(2):46
computer vision and pattern recognition, pp 770–778
38. Pourebrahim N, Sultana S, Edwards J et al (2019) Understanding
17. Hochreiter S, Schmidhuber J (1997) Long short-term memory.
communication dynamics on twitter during natural disasters: a
Neural Comput 9(8):1735–1780
case study of hurricane sandy. Int J Disaster Risk Reduct
18. Huang G, Liu Z, Van Der Maaten L et al (2017) Densely con-
37(101):176
nected convolutional networks. In: Proceedings of the IEEE
39. Rizk Y, Jomaa HS, Awad M et al (2019) A computationally
conference on computer vision and pattern recognition,
efficient multi-modal classification approach of disaster-related
pp 4700–4708
twitter images. In: Proceedings of the 34th ACM/SIGAPP sym-
19. Kejriwal M, Zhou P (2020) On detecting urgency in short crisis
posium on applied computing, pp 2050–2059
messages using minimal supervision and transfer learning. Soc
40. Shah R, Zimmermann R (2017) Multimodal analysis of user-
Netw Anal Min 10(1):1–12
generated multimedia content. Springer, Berlin
20. Kumar A, Singh JP, Dwivedi YK et al (2020) A deep multi-
41. Shah RR, Yu Y, Zimmermann R (2014) Advisor: personalized
modal neural network for informative twitter content classifica-
video soundtrack recommendation by late fusion with heuristic
tion during emergencies. Ann Oper Res 2020:1–32
rankings. In: Proceedings of the 22nd ACM international con-
21. Kyrkou C, Theocharides T (2019) Deep-learning-based aerial
ference on multimedia, pp 607–616
image classification for emergency response applications using
42. Shah RR, Mahata D, Choudhary V et al (2018) Multimodal
unmanned aerial vehicles. In: CVPR workshops, pp 517–525
semantics and affective computing from multimedia content. In:
22. Liu Y, Ott M, Goyal N et al (2019) Roberta: a robustly optimized
Intelligent multidimensional data and image processing. IGI
bert pretraining approach. arXiv preprint arXiv:1907.11692
Global, pp 359–382
123
43. Simonyan K, Zisserman A (2014) Very deep convolutional net- international conference on artificial intelligence and its appli-
works for large-scale image recognition. arXiv preprint arXiv: cations, pp 1–7
1409.1556 50. Yu Y, Tang S, Aizawa K et al (2018) Category-based deep cca
44. Singh T, Kumari M (2016) Role of text pre-processing in twitter for fine-grained venue discovery from multimodal data. IEEE
sentiment analysis. Procedia Comput Sci 89:549–554 Trans Neural Netw Learn Syst 30(4):1250–1258
45. Snyder LS, Lin YS, Karimzadeh M et al (2019) Interactive 51. Yu Y, Tang S, Raposo F et al (2019) Deep cross-modal corre-
learning for identifying relevant tweets to support real-time sit- lation learning for audio and lyrics in music retrieval. ACM Trans
uational awareness. IEEE Trans Vis Comput Graph Multimed Comput Commun Appl 15(1):1–16
26(1):558–568 52. Zahra K, Imran M, Ostermann FO (2020) Automatic identifica-
46. Szegedy C, Liu W, Jia Y et al (2015) Going deeper with con- tion of eyewitness messages on twitter during disasters. Inf
volutions. In: Proceedings of the IEEE conference on computer Process Manag 57(1):102,107
vision and pattern recognition, pp 1–9
47. Tan M, Le Q (2019) Efficientnet: rethinking model scaling for Publisher’s Note Springer Nature remains neutral with regard to
convolutional neural networks. In: International conference on jurisdictional claims in published maps and institutional affiliations.
machine learning, PMLR, pp 6105–6114
48. Tripathy JK, Chakkaravarthy SS, Satapathy SC et al (2020)
Springer Nature or its licensor holds exclusive rights to this article
Albert-based fine-tuning model for cyberbullying analysis. Mul-
under a publishing agreement with the author(s) or other rightsh-
timed Syst 2020:1–9
older(s); author self-archiving of the accepted manuscript version of
49. Valdez DB, Godmalin RAG (2021) A deep learning approach of
this article is solely governed by the terms of such publishing
recognizing natural disasters on images using convolutional
agreement and applicable law.
neural network and transfer learning. In: Proceedings of the
123

Multimodal Tweet Classification in Disaster Response Systems Using Transformer-Based Bidirectional Attention Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multimodal Tweet Classification in Disaster Response Systems Using Transformer-Based Bidirectional Attention Model

Uploaded by

Copyright:

Available Formats

Neural Computing and Applications

Multimodal tweet classification in disaster response systems using

Received: 12 March 2022 / Accepted: 6 September 2022

1 Introduction disaster happened. The unpredictability of the emergency

Fig. 1 Users’ post during disasters

3.1 Text preprocessing Fine-tuning is an approach to transfer learning. In a real-

of extracting complicated features that accurately charac-

Transformer is a sequence processing algorithm. So, the 4 Proposed system

Fig. 4 Vision transformer

Fig. 5 Overall system

4.2 Feature extraction modules • removal of duplicate tweets, stopwords, punctuations,

H: Height of the image

4.3.2 Attention layer

In the proposed approach, we use the idea of attention put

Now, C represents the fused contextual fine-grained feature

The results obtained and the performance of the proposed

Since our aim is multimodal data analysis, we focused on

Table 1 Distribution of data in

1 Hurricane Irma 3544 2208 960 2296 2017 Sep 6 to Sep 9

Table 2 Results of various Text

Table 3 Results of various

Table 4 Results of multimodal fusion strategies on California Wild-

M1 80.0 77.0 80.0 78.0 – M1 89.0 91.0 86.0 88.0 –

7.2 Image-only classification

Fig. 8 ROC AUC curves for text only classification

Fig. 9 ROC AUC curves for image only classification

possible pair of CrisisMMD dataset. As our focus is to

We presented a neural network architecture for informative

References 23. Madichetty S, Muthukumarasamy S (2020) Detection of situa-

You might also like