Major Project Report Draft III

Hero, Villain and Victim: Dissecting
harmful memes for Semantic role labelling

of entities
A Major Project Report

Submitted in partial fulfillment of the requirements for the degree of
Bachelor of Technology
in
Computer Science and Engineering
by
Sumith Sai Budde 18BCS101

Shaik Fharook 18BCS091
Syed Sufyan Ahmed 18BCS103
Gurram Rithika 18BCS031
Under the guidance of
Dr. Sunil Saumya

Asst. Professor, IIIT Dharwad
Department of Computer Science and Engineering

Indian Institute of Information Technology
Dharwad_580009 (India)
May 2022
Certificate
Department of Computer Science and Engineering

Indian Institute of Information Technology, Dharwad
It is certified that the work contained in the project report entitled “Hero, Villain
and Victim: Dissecting harmful memes for Semantic role labelling of entities” by the
following students has been carried out under my supervision and that this work has not
been submitted elsewhere for a degree.

This project report entitled “Hero, Villain and Victim: Dissecting harmful
memes for Semantic role labelling of entities” submitted by the group 21 is approved
for the degree of Bachelor of Technology.
Date: 05 May 2022 Supervisor

Place: Dharwad Dr. Sunil Saumya
Asstistant Professor
Dept. of Computer Science & Engineering
IIIT Dharwad
Head of Department
Dr. Uma Sheshadri
Professor
Dept. of Computer Science & Engineering
IIIT Dharwad
Declaration
05 May 2022
IIIT Dharwad
We certify that this written submission truly describes our ideas in our own words and
that when other’s ideas or words are used, we have properly cited and referenced the orig-
inal sources. We certify that all sources used in the preparation of this report have been
properly and accurately acknowledged. We further declare that we have adhered to all the
principles of academic honesty and integrity and have not misrepresented, faked, fabri-
cated or falsified any idea, data, fact or source. We acknowledge that any infringement
of the foregoing will result in disciplinary action by the Institute, as well as legal action
from the sources who were not correctly referenced or from whose permission was not
obtained when required.

ii
Acknowledgements
05 May 2022
IIIT Dharwad
Despite the efforts we have put on this project, it would not have been feasible with-
out the kind support and assistance of many individuals and organisations. We would like
to express our heartfelt gratitude to each and every one of them.
We would like to sincerely thank Asst. Prof. Dr. Sunil Saumya and Mr. Shankar
Biradar for their kind cooperation, encouragement, guidance, and constant supervision,
as well as for providing important project information and for their support in finishing
the project.
We would like to express our gratitude towards our parents faculty members of IIIT
Dharwad for their kind co-operation and encouragement which help us in completion of
this project.

iii
Abstract
The Identification of good and evil through representations of heroism, villainy and
victim-hood i.e., role labelling of entities has recently piqued the scientific community’s
interest. Due to the massive increase in the population of the memes, the number of
objectionable content is increasing at an astounding rate, therefore producing a stronger
interest to address this issue and examine the memes for content moderation. Techniques
like Framing can be used to categorize entities engaged as heroes, villains, victims or oth-
ers. Framing can used to visualize the entities associated in the meme as heroes, villains,
victims or others thus readers may anticipate better and understand their behaviours and
attitudes as characters. In this report we discussed different pre-processing techniques
used along with four approaches to role label the entities of the meme as hero, villain,
victim or other through techniques such as Named Entity Recognition(NER), Sentiment
Analysis, Image Captioning etc. We have choosen this project as a part of the competi-
tion Shared Task@Constraint 20221 organized by Indraprastha Institute of Information
Technology Delhi (IIIT-Delhi) collocated with ACL 2022. We are pleased to inform that
the first two approaches discussed in the report have been accepted for the competition,
securing eighth position with an F1-Score of 23.855 and making it possible us to publish
a research paper for ACL 2022 2 .
1
https://lcs2.iiitd.edu.in/CONSTRAINT-2022/
2
Link of the published research paper https://aclanthology.org/2022.constraint-1.3.pdf
iv
Table of Contents
Acknowledgements iii
Abstract iv
List of Figures vii
List of Tables viii
1 Introduction 1
2 Literature Review 3
3 Task and Dataset Description 5

3.1 About the Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Prerequisites 10
4.1 VADER Sentiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Wu-Palmer Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Wu-Palmer Similarity Calculation . . . . . . . . . . . . . . . . . . . . . 10
4.4 Image Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.4.1 Architecture of Inception-v3 Model . . . . . . . . . . . . . . . . 12
4.4.2 The Final Inception v3 Model . . . . . . . . . . . . . . . . . . . 15
4.4.3 Architecture of Long Short-Term Memory(LSTM) . . . . . . . . 17
4.4.4 Image Captioning using Inception-v3 and LSTM . . . . . . . . . 19
5 Methodology 21
5.1 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1.1 Entity Sentence Linking . . . . . . . . . . . . . . . . . . . . . . 21
5.1.2 Creation of Role-dictionaries . . . . . . . . . . . . . . . . . . . . 22
v
Table of Contents vi
5.1.3 Similarity Dictionary . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Methods and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.1 Framework-I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.2 Framework-II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2.3 Improved Framework-I . . . . . . . . . . . . . . . . . . . . . . . 28
5.2.4 Improved Framework-II . . . . . . . . . . . . . . . . . . . . . . 30
6 Results 32
7 Conclusion and Future Scope 34
References 35
List of Figures
3.1 Examples of entities portrayed as heroes, villains, victims and others

within memes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Dataset Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Category Wise Entity Distribution . . . . . . . . . . . . . . . . . . . . . 8
3.4 Entity Wordcloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.5 Feature Distribution Plots . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Wu-Palmer similarity score determination . . . . . . . . . . . . . . . . . 11

4.2 Image Captioning example . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Factorization into Smaller Convolutions . . . . . . . . . . . . . . . . . . 13
4.4 Spatial Factorization into Asymmetric Convolutions . . . . . . . . . . . . 14
4.5 Efficient Grid Size Reduction . . . . . . . . . . . . . . . . . . . . . . . . 14
4.6 Components of Inception-v3 . . . . . . . . . . . . . . . . . . . . . . . . 15
4.7 Architecture of Inception-v3 . . . . . . . . . . . . . . . . . . . . . . . . 16
4.8 LSTM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.9 LSTM hidden layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.10 Image Captioning example and utilization . . . . . . . . . . . . . . . . . 18
4.11 Image Captioning using Inception-v3 and LSTM . . . . . . . . . . . . . 20
5.1 Entity sentence linking example . . . . . . . . . . . . . . . . . . . . . . 22

5.2 Role Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Similarity Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Framework-I Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Framework-II Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.6 Improved Framework-I architecture . . . . . . . . . . . . . . . . . . . . 29
5.7 Improved Framework-II architecture . . . . . . . . . . . . . . . . . . . . 31
6.1 Output of Framework-I and Framework-II . . . . . . . . . . . . . . . . . 32
vii
List of Tables
3.1 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.2 Top performing teams in Shared Task@Constraint 2022 . . . . . . . . . . 33
viii
Chapter 1
Introduction
The easy accessibility to internet and technology has attracted the interest of today’s
youth in social media. These applications offer a large platform for users to communicate
with others and share their thoughts and opinions.With these advantages comes a disad-
vantage: many people exploit the platform to spread offensive content on social media
under the guise of freedom of expression[1]. This incendiary material is usually directed
towards a single person, a small group of people, a religious group, or a community.
People create offensive content and aggressively spread it over social media [2, 3].
For many purposes, including commercial and political benefit, this type of informa-
tion is created [4, 5]. This type of communication can disturb societal harmony and spark
riots. It also has the ability to have a negative psychological impact on readers. It has
the potential to harm people’s emotions and behavior [6, 7]. As a result, identifying such
content is crucial. Further, researchers, politicians, and investors are working to build a
reliable method for dissecting the dangerous memes present over the internet.
Framing allows a communication source to portray and describe a problem within a
"field of meaning" by employing conventional narrative patterns and cultural references
[8]. By connecting with readers’ existing knowledge, cultural narratives, and moral stan-
dards, framing helps to construct events [9]. It can portray the characters in a story as
heroes, villains, or victims, making it easier for the audience to anticipate and compre-
hend their attitudes, beliefs, decisions, and actions.
Narrative frames can be found in various media, including memes, films, literature,
and the news. Narrators use emotionality to plainly distinguish between good and evil
through vivid descriptions of victimization, heroism, and villainy, which is a major feature
of the popular storytelling culture [10]. Positive adjectives are used to portray heroes,
whereas negative terms depict victims and villains.
1
2
In popular culture, heroes represent bravery, great accomplishments, or other noble

attributes, whereas villains represent malicious intents, conspiring, and other undesirable
characteristics [10]. To summarise, narrative frames are essential for understanding new
situations in terms of prior ones and therefore making sense of the causes, events, and
consequences.
The standard method for detecting frames of the narrative is by examining the se-
mantic relationships between the various elements in the meme about the events it por-
trays. Understanding the events in a narrative and the roles that the entities in that meme
play in those events, on the other hand, is a complex, tough, and computationally expen-
sive task.
Thus, rather than determining all of the specific events and event types described
in the meme, as well as the semantic relationships among the entities involved in those
events in great detail, we propose methodologies in which the entities are analyzed at a
much higher level of abstraction, specifically in terms of whether they hold the qualities of
heroes, victims, villains, or none as conveyed by the terms used to characterize them. As
a result, we arrive at a rather basic realization. The terms nearest to each entity are evalu-
ated for their sentiment polarity or closeness to associated terms with heroes, villains, or
victims.
Chapter 2
Literature Review
Social media is being adversely affected with the rapid spread of

misinformation([11], [12], [13]), fake news([14], [15]), hate speech([16], [17]), the
COVID-19 infodemic([18]), propaganda and other unpleasant harmful content. Recently,
memes have emerged as a potent multi-modal way of disseminating dangerous material
due to their capacity to avoid censorship rules and their rapid dissemination. With this
The topic of entity role detection from narrative has recently piqued the interest of several
corporate and academic researchers in the recent times. However, there were just a few
efforts to extract knowledge and present it from newspaper articles that especially utilized
the newspaper article bodies to derive meaning, focusing on the headline ([19], [20],
[10]).
An apparently innocent meme may readily become a source of harmful information
spread with an appropriately constructed combination of graphics and words. As a result,
investigating the negative aspects of memes has become an important academic issue [12].
In general, a meme can be analysed in a variety of contexts, such as hate speech[21][22],
emotions[23], misinformation[24], offensiveness[25] and propaganda[26], but [12] found
out that only limited studies have been made on comprehending the role of entities that
make up a meme. This type of classification of the entities in the meme can assist in
understanding the entity-specific meaning as well as their nature, attitudes, decisions, and
demeanour[12].
While prior meme studies attempted to identify harmfulness and the entities [12]
or categories that are being targeted, such as a person, a group, an organisation, or soci-
ety [27], none of them examined the entity’s connotation which helps in underlying the
individual entities in the meme. Several studies on online targeting in context of harm-
ful disclosure in social media says that sarcastic content can be detected by leveraging
data sparseness[28] towards studying aspect-based sentiment analysis. Different stud-
3
4
ies on made on detecting harmful memes tells using various multi-modal frameworks
and large data-sets is crucial in the research of hate speech detection, offense and online
harm detection([25], [22]). Some studies says studying additional cues involving com-
mon sense knowledge[29], semantic entities, about the protected categories[30] along
with other meta information can be explored in characterization of online harm conveyed
by memes at various levels of granularity.
While memes are a mix of picture and text data, it is equally vital in role detection
to obtain context from both image and text data. (Kun et al., 2022)[31] attemped to get
the context of image by maing use of Celebretity face detection using Giphy’s GitHub1
followed by a sub-image detector using YoloV52 . His work based on utilizing a mix
of ensembeled models like DeBERTa[32], RoBERTa[33], ViLT[34], EfficientNetB7[35]
shows memes can be studies from both the image and text context. (Singh et al., 2022)[36]
has given a new way of approaching the text data by formulating the problem as Multi-
ple Choice Question Answering Task (MCQA). (Zhou et al., 2022)[37] leveraged Visual
Commonsense Reasoning (VCR) framework along with some ensemble models to get a
greater context of image.
Although many of the approaches are theoretical, there have been hardly few at-
tempts that have been made to role label the entities that had been exalted, demonized, or
victimized ([38]). Instead, studies were conducted to see how satire delivered through the
means of internet memes affects brand image ([39]). We tried a different approach based
on sentiment and lexicon to associate sentiment polarity for role labelling.
1
https://github.com/Giphy/celeb-detection-oss
2
https://github.com/ultralytics/yolov5
Chapter 3
Task and Dataset Description
3.1 About the Task

For this task, we consider choosing the problem from Shared Task@Constraint 20221
organized by Indraprastha Institute of Information Technology Delhi (IIIT-Delhi) hosted
on Codalab2 . According to the competition, the problem description is as follows: Given
a meme and an entity. The task is to determine the role of each entity in the meme as
hero, villain, victim or other.In general, a meme can be analysed in a variety of contexts,
such as hate speech[21][22], emotions[23], misinformation[24], offensiveness[25] and
propaganda[26]. The key constraint in this problem is the meme is to be analysed from
the perspective off the author of the meme [40]. In other words, role labelling the entities
within the meme from the author’s perspective.
The label assigned to each entity is determined by how the entity is portrayed in the
meme:
• Hero: Entity which is glorified for their actions.
• Villain: Entity that is portrayed negatively in relation with undesirable outcomes

such as cruelty, hypocrisy, and so on.
• Victim: Entity which is victimized from the negative impact of someone else’s
actions.
• Other: Entity neither a hero, a villain nor a victim.
1
https://constraint-lcs2.github.io/
2
https://codalab.lisn.upsaclay.fr/competitions/906
5
3.2 Dataset Description 6
3.2 Dataset Description

For this problem, we consider choosing the novel dataset HVVMemes[40] provided
by the authors of the competition. This dataset is leveraged and re-annotated from the
HarMeme dataset released at Findings of EMNLP 2021[30]. This dataset is a curation of
6933 memes spanning from the domains of COVID-19 and US Politics of which each
meme is annotated with a list of entities involved in it. Figure 3.1 shows examples of
entities from different memes and its associated entities categorized into heroes, villains,
victims & others. Figure 3.1a glorifies Chuck Norris as hero and shows Corona Virus as
villain & victim. Similarly, In Figure 3.1b frames how public is thinking about the covid19
situation. On the other hand, Figure 3.1c shows comparision Sonu Sood and food delivery
companies about the service delivery praising Sonu Sood as hero and Zomato, Swiggy as
others.
(b)
(a)
(c)
Figure 3.1: Examples of entities portrayed as heroes, villains, victims and others within
memes.
3.2 Dataset Description 7
Each item of the train and validation dataset contains an image representing the
meme and its metadata which contains pre-extracted OCR text along with its entities
mapped to Hero, Villain, Victim & Other categories. A sample from train & validation
dataset can be seen in Figure 3.2a. Similarly, Each item of the test dataset contains an
image representing the meme and its metadata i.e., the pre-extracted OCR text along with
entities. A sample from test dataset can be shown in Figure 3.2b.
(a) Train/Validation Dataset Sample (b) Test Dataset Sample
Figure 3.2: Dataset Sample
This dataset is contains a total of 6933 memes organized into three parts: train,
validation and test set respectively. The dataset is distributed in a well balanced domain
wise with 3381 COVID-19 and 3552 US Politics memes respectively. This dataset is fairly
split-ted with 5552 Train samples, 650 Validation samples and 731 Test samples. Table
3.1 gives a detailed domain-wise distribution of the dataset.
3.3 Exploratory Data Analysis 8
Domains
Splits
COVID-19 US Politics Total
Train 2700 2852 5552

Validation 300 350 650
Test 381 350 731
Total 3381 3552 6933
Table 3.1: Dataset Distribution
3.3 Exploratory Data Analysis

We performed a detailed Exploratory Data Analysis(EDA) on the metadata i.e., the
OCR text and its entities of the dataset by considering the features such as Sentiment Po-
larity Score on OCR text, individual & total entities, length of OCR, Word Count of OCR,
Average Word length of OCR, Parts of Speech, Unigram, Bigram and Trigram distribution
of OCR text. Figure 3.5 shows all the distribution plots against different features.
Figure 3.3: Category Wise Entity

Distribution
Figure 3.4: Entity Wordcloud
The Categorical Wise Distribution shown in Figure 3.3 shows us 78% of the entities
sits in other and only 3% share with hero category. The Top 50 Words entity wordcloud
3.3 Exploratory Data Analysis 9
which can be seen in Figure 3.4 gives insights about the most repeated entites across the
dataset. The Sentiment Polarity Score Distribution plot shown in Figure 3.5a justifies that
most of the OCR text data is neutral with a sentiment polarity ranging between -0.025 to
0.0249. A detailed distribution across different features can be shown in Figure 3.5.
(a) Sentiment Polarity Score of OCR (b) Parts of Speech Distribution of OCR
(d) OCR Text Word Count Distribution

(c) OCR Text Length Distribution
(e) Average OCR Word Length Distribution (f) Top 50 Unigram Distribution of OCR
(g) Top 50 Bigram Distribution of OCR (h) Top 50 Trigram Distribution of OCR
Figure 3.5: Feature Distribution Plots

Chapter 4
Prerequisites
4.1 VADER Sentiment

The process of ‘computationally’ determining whether a piece of writing is positive,
negative or neutral is called Sentiment-Analysis. It’s also known as opinion-mining, de-
riving the opinion or attitude of a speaker.
VADER1 which stands for Valence Aware Dictionary and sEntiment Reasoner is
a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments
expressed in social media. In this methodology, every one of the words in the vocabulary
is appraised with respect to whether it is positive or negative and how +ve or -ve.
4.2 Wu-Palmer Similarity

It is a metric defined over a set of documents or terms(words), where the idea of
distance between items is based on the likeness of their meaning or semantic content as
opposed to lexicographical similarity.
Wu-Palmer2 similarity calculates the relatedness by considering the depths of the
two synsets [41] in the WordNet [42] taxonomies. Wu-Palmer similarity determines the
similarity score on how similar the word senses.
4.3 Wu-Palmer Similarity Calculation

Wu-Palmer Similarity calculates relatedness by considering the depths of the two
synsets in the WordNet taxonomies, along with the depth of the LCS (Least Common
Subsumer) as shown in the Figure 4.1.
1
https://pypi.org/project/vaderSentiment/
2
https://arxiv.org/ftp/arxiv/papers/1310/1310.8059.pdf
10
4.3 Wu-Palmer Similarity Calculation 11
Figure 4.1: Wu-Palmer similarity score determination
The similarity-score lies between (0, 1] i,e 0 < similarity-score <= 1. The similarity-
score can never be zero because the depth of the LCS(Least Common Subsumer) is never
zero (the depth of the root of taxonomy is one).
4.4 Image Captioning 12
4.4 Image Captioning

The process of Generating a textual description of an image is known as image cap-
tioning. It makes use of both Natural Language Processing(NLP) and Computer Vision
to generate the captions.
Image captioning requires recognizing the important objects, their attributes, and
their relationships in an image. It also needs to generate syntactically and semantically
correct sentences [43]. Deep-learning-based techniques are capable of handling the com-
plexities and challenges of image captioning [43], [44], [45], [46], [47], [48].
Generating a caption for a given image is a challenging problem in the deep learning
domain. Different techniques of computer vision and NLP are utilized to recognize the
context of an image and describe them in a natural language like English as shown in
Figure 4.2.
Figure 4.2: Image Captioning example
4.4.1 Architecture of Inception-v3 Model
Inception-v3 is a convolutional neural network(CNN) architecture from the Incep-

tion family that makes several improvements including using Label Smoothing, Factor-
ized 7 x 7 convolutions, and the use of an auxiliary classifier to propagate label informa-
tion lower down the network (along with the use of batch normalization for layers in the
side-head) [49],[50].
The Inception V3 is a deep learning model based on Convolutional Neural Networks,
which is used for image classification,image captioning and various other tasks. The
inception V3 is a superior version of the basic model Inception V1 which was introduced
as GoogLeNet in 2014. As the name suggests it was developed by a team at Google.
The inception v3 model was released in the year 2015, it has a total of 42 layers and
a lower error rate than its predecessors. Let’s look at what are the different optimizations
that make the inception V3 model better. The major features of the Inception-V3 model
are :
1. Factorization into Smaller Convolutions
2. Spatial Factorization into Asymmetric Convolutions
3. Utility of Auxiliary Classifiers
4. Efficient Grid Size Reduction
The implementation of the above features and optimizations was done as shown in
Figures-(4.3,4.4,4.5)
Figure 4.3: Factorization into Smaller Convolutions
Using an Auxiliary classifier to improve the convergence of very deep neural net-
works is the goal. In very deep networks, the auxiliary classifier is primarily employed to
tackle the vanishing gradient problem.
In the early stages of the training, the auxiliary classifiers made no difference. How-
ever, in the end, the network with auxiliary classifiers outperformed the network without
them in terms of accuracy.
As a result, the auxiliary classifiers in the Inception-V3 model architecture operate
as a regularizer.
Figure 4.4: Spatial Factorization into Asymmetric Convolutions
Figure 4.5: Efficient Grid Size Reduction

4.4.2 The Final Inception v3 Model
In total, the inception V3 model is made up of 42 layers which is a bit higher than
the previous inception V1 and V2 models. But the efficiency of this model is really
impressive. The components of inception-v3 model are as shown in the Figure 4.6
Figure 4.6: Components of Inception-v3
After performing all the optimizations the final Inception-V3 model looks like this
as shown in Figure 4.7
4.4
Image Captioning
Figure 4.7: Architecture of Inception-v3

16
4.4.3 Architecture of Long Short-Term Memory(LSTM)
Long Short Term Memory(LSTM) is a kind of recurrent neural network(RNN). In

RNN output from the last step is fed as input in the current step. LSTM was designed by
Hochreiter & Schmidhuber. It tackled the problem of long-term dependencies of recurrent
neural network(RNN) in which the RNN cannot predict the word stored in the long-term
memory but can give more accurate predictions from the recent information. As the
gap length increases RNN does not give an efficient performance. LSTM can by default
retain the information for a long period of time [51], [52], [53]. It is used for processing,
predicting, and classifying on the basis of time-series data.
Figure 4.8: LSTM cell
The basic difference between the architectures of RNNs and LSTMs is that the hid-
den layer of LSTM is a gated unit or gated cell. It consists of four layers that interact with
one another in a way to produce the output of that cell along with the cell state. These
two things are then passed onto the next hidden layer. Unlike RNNs which have got the
only single neural net layer of tanh, LSTMs comprises of three logistic sigmoid gates and
one tanh layer as shown in Figure 4.8. Gates have been introduced in order to limit the
information that is passed through the cell. They determine which part of the information
will be needed by the next cell and which part is to be discarded. The output is usually in
the range of 0-1 where ‘0’ means ‘reject all’ and ‘1’ means ‘include all’.
Figure 4.9: LSTM hidden layer
Figure 4.10: Image Captioning example and utilization

4.4.4 Image Captioning using Inception-v3 and LSTM
Machines, unlike humans, are unable to comprehend images simply by looking at

them. As a result, we must turn the image into an encoding so that the system can recog-
nise the patterns. We employ transfer learning for this task, which means we take a
pre-trained model that has already been trained on huge datasets and extract the features
from it to use in our work. We are using the InceptionV3 model, which was trained on
an Imagenet dataset with 1000 discrete classes to sort through. To retrieve the (2048,)
dimensional feature vector from the InceptionV3 model, we need to remove the last clas-
sification layer.
We will use the encoding of an image and use a start word to predict the next word.
After that, we will again use the same image and use the predicted word to predict the
next word [54], [55], [56], [41]. So, the image will be used at every iteration for the entire
caption. This is how we will generate the caption for an image as shown in Figure 4.11.
After generating image caption we combine the generated captions of an image with the
OCR text of the respective image as shown in the Figure 4.10
4.4
Image Captioning
Figure 4.11: Image Captioning using Inception-v3 and LSTM

20
Chapter 5
Methodology
During the course work, we proposed two frameworks based on two different
methods.In the first method, we perform entity recognition and then sentiment analy-
sis(VADER sentiment). The second method is performing entity recognition followed by
Wu-Palmer Similarity[57] to calculate similarity scores of entities with each if the roles
i.e., hero, villain, victim and other. Then image captioning was utilized to derive more
context from the memes and to aid both frameworks in the task of role labelling of entities.
5.1 Data Pre-Processing

The following data processing steps were performed while creating an end-to-end
system, i.e., given a meme image, the OCR text we recognize the entities present in that
meme by performing entity recognition on the text. However, in the competition, as the
entities are already recognized and given as entity list, the entity recognition step can be
skipped.
Then each entity is linked to its corresponding parts of the sentence (words surround-
ing the entity) present in the OCR text of that respective meme. Then role dictionaries
were curated using the OCR which were used to create the similarity dictionary.
5.1.1 Entity Sentence Linking
To link each entity in a particular meme to its corresponding parts of the sentence
(words surrounding the entity) present in the OCR text of that respective meme. We make
a fair assumption that the words nearer to the entities weigh more than those word which
are farther from the entity in its role assignment.
So first, we search for entity occurrence in the OCR sentences. Then using a window
approach(i.e., selecting the n-words occurring before that entity and the n-words occurring
21
5.1 Data Pre-Processing 22
after the entity), we create a sub-part of that sentence. By doing this on the whole OCR
of that respective meme, we create a list of sub-sentences, one for each entity present in
that particular meme as shown in Figure 5.1.
Figure 5.1: Entity sentence linking example
5.1.2 Creation of Role-dictionaries
People use framing to contextualize events by connecting with reader’s prior knowl-
edge, cultural narratives, and moral values. It helps to present the agents involved in a
story as heroes, villains, victims, none, so that readers can more easily anticipate and
comprehend the attitudes, beliefs, decisions, and actions of the agents portrayed.
Framing theory concludes that generally positive phrases are used to characterise
heroes, whereas negative terms are used to depict victims and villains. In general heroes
represent bravery, great accomplishments, or other noble attributes, whereas villains rep-
resent malicious intents, planning, and other undesirable characteristics.
Thus iterating through the whole OCR we have curated three word dictionaries
namely hero dictionary, villain dictionary and victim dictionary as shown in Figure 5.2.
Where hero dictionary contains terms of positive sentiment which would be generally
used to represent the role of heroes, villain dictionary and victim dictionary contains
words of negative sentiment which are generally used to represent villains,victims re-
spectively.
Figure 5.2: Role Dictionaries

5.1 Data Pre-Processing 23
5.1.3 Similarity Dictionary
For each word in the English dictionary we calculate its similarity score with all the
words present in hero dictionary, villain dictionary and victim dictionary respectively to
create the similarity dictionary.
For each word in the English dictionary(Wordnet dictionary) we calculate its sim-
ilarity score(using Wu-Palmer similarity) with all the words present in hero dictionary,
villain dictionary and victim dictionary respectively to determine its similarity with the
roles of hero,villain,victim and normalize those to create the similarity dictionary.
To determine a words similarity score with the role of hero we calculate that words
similarity score by comparing it with all the words in the hero dictionary and then take
the normalized sum as the similarity score.
To determine a words similarity score with the role of villain we calculate that words
similarity score by comparing it with all the words in the villain dictionary and then take
To determine a words similarity score with the role of victim we calculate that words
similarity score by comparing it with all the words in the victim dictionary and then take
In the similarity dictionary for each word the first entry represents the similarity
score with the hero role, the second entry represents the similarity score with the villain
role, the third entry represents the similarity score with the victim role respectively as
shown in the Figure 5.3
Figure 5.3: Similarity Dictionary

5.2 Methods and Models 24
5.2 Methods and Models

During the course work, two different frameworks have been experimented for role
detection. The description of the frameworks are discussed in the following subsections.
5.2.1 Framework-I
1. For each entity given in a particular meme, identify the words close(i.e., surround-
ing words) to these entities by linking the entity sentence.
2. Perform sentiment analysis to determine the polarity of these words, thus making
out the sentiment attributed to the entity.
3. Use sentiment polarity to role label the entities, according to the proposed semantic
classes.
After performing entity sentence linking, we determine the sentiment score of the
words(sub-sentences) linked with an entity; we do this for all the entities mentioned in
that particular meme. To do this, we calculate the sentiment(i.e., word polarity) for each
word using a standard toolkit like VADER-Sentiment1 (as it has a huge vocabulary of
the word polarities), thus getting a polarity for each word, which ranges between [-1, 1]
(i.e., very-negative to very-positive). These sentiment-polarities are then summed up for
each sentence. Finally, the sentiment-polarities for each sentence are normalized and then
averaged to get an overall sentiment ascribed for the entity [58].
As we all know, that hero is associated with positive words with positive sentiment,
thus entities with positive sentiment attributed to them are role labelled as hero. Similarly,
victims and villains are associated with negative words with negative sentiments, thus
entities with negative sentiment attributed to them are role labelled as villain or victim
based on the magnitude of negativity. If the words(sub-sentences) have no polarity, they
don’t glorify or vilify or victimize any entity thus semantically similar to the class "other"
as described in Figure 5.4.
1
5.2
Methods and Models
Figure 5.4: Framework-I Architecture

25
5.2.2 Framework-II
2. Determine the resemblance of these words with the words used to describe heroes,
villains, and victims by curating word sets or dictionaries for each role.
3. Role label the entities by analyzing their similarity scores with those of hero, villain,
and victim. If the scores are zero or almost the same, role label it to "other" class.
After performing entity sentence linking, We create three dictionaries, one for each
hero, villain, and victim containing the words or terms similar to them, respectively. Then
by using a method like Wu-Palmer similarity2 we calculate the similarity score of each
word from the entity-sentence linking step with hero dictionary, villain dictionary, vic-
tim dictionary which were crafted by hand going through the whole OCR to create the
similarity dictionary Figure 5.3.
Then the similarity score for each entity is determined by summing the similarity
scores of all the words found in the sub-sentences. Then it is normalized to get an overall
similarity of a particular entity with the roles of hero, villain, victim, and others. We
assign an entity to the role whose similarity score is the highest using these similarity
scores [58]. Like, if an entity has the highest similarity score with that of hero role we
role label that entity as hero and similarly for all the other roles. If the similarity scores
with each of the roles are almost similar or zero, we assign it to the class "other" in the
proposed role assignment approach as described in Figure 5.5.
2
5.2
Similarity score
Entity Sentence
Entity List Linked terms calculation(Using Similarity Score Role assignment
Linking
Wu-Palmer method)
Methods and Models
Meme Similarity
Hero Villain Victim Other
Dictionary
Entity Entity Entity

Entity Entity Hero
Villain
Dictionary Dictionary
Linked Linked Linked Victim
Linked
terms terms terms Dictionary
terms
Figure 5.5: Framework-II Architecture

27
5.2.3 Improved Framework-I
1. For each meme-image generate the image caption and combine the generated cap-
tion with the initial OCR of the respective meme.
3. Perform sentiment analysis to determine the polarity of these words, thus making
out the sentiment attributed to the entity.
4. Use sentiment polarity to role label the entities, according to the proposed semantic
classes.
To have greater context from the image we generate captions from the im-
ages(memes) using machine learning models like Inception-v3 along with LSTM as
shown in Figure 4.11 and use it to supplement the OCR.
Then after performing entity sentence linking, we determine the sentiment score of
the words(sub-sentences) linked with an entity; we do this for all the entities mentioned
in that particular meme. To do this, we calculate the sentiment(i.e., word polarity) for
each word using a standard toolkit like VADER-Sentiment3 (as it has a huge vocabulary
of the word polarities), thus getting a polarity for each word, which ranges between [-1,
1] (i.e., very-negative to very-positive). These sentiment-polarities are then summed up
for each sentence. Finally, the sentiment-polarities for each sentence are normalized and
then averaged to get an overall sentiment ascribed for the entity [58].
As we all know, that hero is associated with positive words with positive sentiment,
thus entities with positive sentiment attributed to them are role labelled as hero. Similarly,
victims and villains are associated with negative words with negative sentiments, thus
entities with negative sentiment attributed to them are role labelled as villain or victim
based on the magnitude of negativity. If the words(sub-sentences) have no polarity, they
don’t glorify or vilify or victimize any entity thus semantically similar to the class "other"
as described in Figure 5.6.
3
5.2
Methods and Models
Figure 5.6: Improved Framework-I architecture

29
5.2.4 Improved Framework-II
1. For each meme-image generate the image caption and combine the generated cap-
tion with the initial OCR of the respective meme.
3. Determine the resemblance of these words with the words used to describe heroes,
villains, and victims by curating word sets or dictionaries for each role.
4. Role label the entities by analyzing their similarity scores with those of hero, villain,
and victim. If the scores are zero or almost the same, role label it to "other" class.
To have greater context from the image we generate captions from the im-
ages(memes) using machine learning models like Inception-v3 along with LSTM as
shown in Figure 4.11 and use it to supplement the OCR.
Then after performing entity sentence linking, We create three dictionaries, one for
each hero, villain, and victim containing the words or terms similar to them, respectively.
Then by using a method like Wu-Palmer similarity4 we calculate the similarity score of
each word from the entity-sentence linking step with hero dictionary, villain dictionary,
victim dictionary which were crafted by hand going through the whole OCR to create the
similarity dictionary Figure 5.3.
Then the similarity score for each entity is determined by summing the similarity
scores of all the words found in the sub-sentences. Then it is normalized to get an overall
similarity of a particular entity with the roles of hero, villain, victim, and others. We
assign an entity to the role whose similarity score is the highest using these similarity
scores [58]. Like, if an entity has the highest similarity score with that of hero role we
role label that entity as hero and similarly for all the other roles. If the similarity scores
with each of the roles are almost similar or zero, we assign it to the class "other" in the
proposed role assignment approach as described in Figure 5.7.
4
5.2
Methods and Models
Figure 5.7: Improved Framework-II architecture

31
Chapter 6
Results
For the competition, teams were ranked based on macro F1-Score across all the
classes. The suggested method and model secured the eighth position in the competi-
tion for the task of dissecting harmful memes for Semantic role-labeling of entities. Table
6.2 shows the rankings of various teams, and the performance of the proposed system is
indicated in bold letters. The output for a meme from the test sample is shown in Figure
6.1. The figure contains both Framework-I and Framework-II generated role labels.
Figure 6.1: Output of Framework-I and Framework-II
Model Macro Precision Macro Recall Macro F1
Framework - I 25.577 23.799 23.855

Framework - II 25.577 23.799 23.855
Table 6.1: Performance Metrics
32
33
The model performs well in the role labeling task. However, in some cases, the
model under performs in identifying the categories due to the difficulty in capturing some
of the attributes or traits related to the roles. As a result, the overall systems’ macro F1-
score has been low at 23.855 as in Table 6.1. In addition, the ensembling of multiple
NLP sub-tasks also have contributed to the decrease of the F1-score of the system. The
systems’ performance can be further improved by modeling those NLP sub-tasks in the
proposed methods using better parameters which could potentially increase the score.
SL. no Username / Team Name F1 Score
1 Shiroe 58.671
2 jayeshbanukoti 56.005
3 c1pher 55.240
4 zhouziming 54.707
5 smontariol 48.483
6 zjl123001 46.177
7 amanpriyanshu 31.943
8 Team IIITDWD (fharookshaik) 23.855
9 rabindra.nath 23.717
Table 6.2: Top performing teams in Shared Task@Constraint 2022

Chapter 7
Conclusion and Future Scope
The current system implementations use NLP techniques such as entity recognition,
sentiment analysis, and word sets and dictionaries along with some machine learning,
all of which have shown promising results in the role labeling task. Across all classes,
the existing system implementation produced a good F1 score. However, as the model
is based on simple proximity measures, it has issues when dealing with OCR text that
contains composite grammatical structures such as indirect speech, passive voice etc. In
this experiment, the n-words window size used for data processing is n=3. As a result,
there is potential for various future changes to increase the system’s performance.
We also aim to implement image feature recognition on memes with the goal of
recognising facial traits or emotions that can be utilised to determine the meme’s senti-
ment in cases when OCR is unable to do so.
Further, in future experiments and add-ons, we plan to leverage some of the
SOTA(State Of The Art) machine learning models such as SVM to discover distinct sen-
timent polarity boundaries for various sub-tasks to enhance the working of sub-tasks and
thereby improving the system’s role labeling performance.
34
References
[1] M. L. Boon, “Augmenting media literacy with automatic characterization of news

along pragmatic dimensions.” ACM Conference on Computer Supported Coopera-
tive Work and Social Computing, 2017.
[2] S. N. P. Fortuna, “A survey on automatic detection of hate speech in text,” ACM

Computing Surveys (CSUR), 2018.
[3] M. M. I. W. T. Davidson, D. Warmsley, “Automated hate speech detection and the

problem of offensive language,” Proceedings of the International AAAI Conference
on Web and Social Media, 2017.
[4] J. M. J. Jeff Goodwin and F. Polletta, “Passionate politics: Emotions and social
movements,” University of Chicago Press, 2009.
[5] S. Biradar, S. Saumya, and A. Chauhan, “Combating the infodemic: Covid-19 in-
duced fake news recognition in social media networks,” Complex & Intelligent Sys-
tems, pp. 1–13, 2022.
[6] S. Stieglitz and L. Dang-Xuan, “Emotions and information diffusion in social me-
dia—sentiment of microblogs and sharing behavior,” Journal of management infor-
mation systems, 2013.
[7] S. Biradar, S. Saumya, and A. Chauhan, “Hate or non-hate: Translation based hate
speech identification in code-mixed hinglish data set,” in 2021 IEEE International
Conference on Big Data (Big Data). IEEE, 2021, pp. 2470–2475.
[8] D. A. Scheufele, “Framing as a theory of media effects,” Journal of communication,

1999.
[9] M. C. Green, “Transportation into narrative worlds: The role of prior knowledge and
perceived realism,” Discourse processes.
35
References 36
[10] L. B. Diego Gomez-Zara, Miriam Boon, “Detection of roles in news articles us-
ing natural language techniques,” 23rd International Conference on Intelligent User
Interfaces, 2018.
[11] J. M. Struß, M. Siegel, J. Ruppenhofer, M. Wiegand, M. Klenner et al., “Overview

of germeval task 2, 2019 shared task on the identification of offensive language,”
2019.
[12] S. Sharma, T. Suresh, A. Kulkarni, H. Mathur, P. Nakov, M. S. Akhtar, and

T. Chakraborty, “Findings of the CONSTRAINT 2022 shared task on detecting
the hero, the villain, and the victim in memes,” in Proceedings of the Workshop
on Combating Online Hostile Posts in Regional Languages during Emergency
Situations. Dublin, Ireland: Association for Computational Linguistics, May
2022, pp. 1–11. [Online]. Available: https://aclanthology.org/2022.constraint-1.1
[13] M. Hardalov, A. Arora, P. Nakov, and I. Augenstein, “A survey on stance detection

for mis-and disinformation identification,” arXiv preprint arXiv:2103.00242, 2021.
[14] D. M. Lazer, M. A. Baum, Y. Benkler, A. J. Berinsky, K. M. Greenhill, F. Menczer,

M. J. Metzger, B. Nyhan, G. Pennycook, D. Rothschild et al., “The science of fake
news,” Science, vol. 359, no. 6380, pp. 1094–1096, 2018.
[15] S. Vosoughi, D. Roy, and S. Aral, “The spread of true and false news online,” Sci-
ence, vol. 359, no. 6380, pp. 1146–1151, 2018.
[16] S. MacAvaney, H.-R. Yao, E. Yang, K. Russell, N. Goharian, and O. Frieder, “Hate
speech detection: Challenges and solutions,” PloS one, vol. 14, no. 8, p. e0221152,
2019.
[17] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar, “Pre-

dicting the type and target of offensive posts in social media,” arXiv preprint
arXiv:1902.09666, 2019.
[18] F. Alam, S. Cresci, T. Chakraborty, F. Silvestri, D. Dimitrov, G. D. S. Martino,

S. Shaar, H. Firooz, and P. Nakov, “A survey on multimodal disinformation detec-
tion,” arXiv preprint arXiv:2103.12541, 2021.
[19] M. L. Boon, “Augmenting media literacy with automatic characterization of news

along pragmatic dimensions,” in Companion of the 2017 ACM Conference on Com-
puter Supported Cooperative Work and Social Computing, 2017, pp. 49–52.
References 37
[20] D. Dor, “On newspaper headlines as relevance optimizers,” Journal of Pragmatics,

2003.
[21] Y. Zhou and Z. Chen, “Multimodal learning for hateful memes detection,” 2021
IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp.
1–6, 2021.
[22] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, and D. Testug-

gine, “The hateful memes challenge: Detecting hate speech in multimodal memes,”
Advances in Neural Information Processing Systems, vol. 33, pp. 2611–2624, 2020.
[23] C. Sharma, D. Bhageria, W. Scott, S. Pykl, A. Das, T. Chakraborty, V. Pulabaigari,

and B. Gamback, “Semeval-2020 task 8: Memotion analysis–the visuo-lingual
metaphor!” arXiv preprint arXiv:2008.03781, 2020.
[24] S. Zidani and R. Moran, “Memes and the spread of misinformation: Establishing
the importance of media literacy in the era of information disorder,” Teaching Media
Quarterly, vol. 9, no. 1, 2021.
[25] S. Suryawanshi, B. R. Chakravarthi, M. Arcan, and P. Buitelaar, “Multimodal meme

dataset (multioff) for identifying offensive content in image and text,” in Proceedings
of the second workshop on trolling, aggression and cyberbullying, 2020, pp. 32–41.
[26] D. Dimitrov, B. B. Ali, S. Shaar, F. Alam, F. Silvestri, H. Firooz, P. Nakov, and

G. D. S. Martino, “Detecting propaganda techniques in memes,” 2021. [Online].
Available: https://arxiv.org/abs/2109.08013
[27] D. D. R. M. S. S. P. N. T. C. Md. Shad Akhtar, Shraman Pramanick, “Detecting

harmful memes and their targets,” arXiv, 2021.
[28] H. Fujita and A. Selamat, “Hate crime on twitter: Aspect-ased sentiment analysis
approach,” in Advancing Technology Industrialization Through Intelligent Software
Methodologies, Tools and Techniques: Proceedings of the 18th International Con-
ference on New Trends in Intelligent Software Methodologies, Tools and Techniques
(SoMeT_19), vol. 318. IOS Press, 2019, p. 284.
[29] L. Shang, C. Youn, Y. Zha, Y. Zhang, and D. Wang, “Knowmeme: A knowledge-

enriched graph neural network solution to offensive meme detection,” in 2021 IEEE
17th International Conference on eScience (eScience). IEEE, 2021, pp. 186–195.
References 38
[30] S. Pramanick, S. Sharma, D. Dimitrov, M. S. Akhtar, P. Nakov, and T. Chakraborty,

“Momenta: A multimodal framework for detecting harmful memes and their tar-
gets,” arXiv preprint arXiv:2109.05184, 2021.
[31] L. Kun, J. Bankoti, and D. Kiskovski, “Logically at the constraint 2022: Multimodal
role labelling,” in Proceedings of the Workshop on Combating Online Hostile
Posts in Regional Languages during Emergency Situations. Dublin, Ireland:
Association for Computational Linguistics, May 2022, pp. 24–34. [Online].
Available: https://aclanthology.org/2022.constraint-1.4
[32] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disen-
tangled attention,” 2021.
[33] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle-
moyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,”
2019.
[34] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without con-
volution or region supervision,” 2021.
[35] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional
neural networks,” 2020.
[36] P. Singh, A. Maladry, and E. Lefever, “Combining language models and

linguistic information to label entities in memes,” in Proceedings of the Workshop
on Combating Online Hostile Posts in Regional Languages during Emergency
Situations. Dublin, Ireland: Association for Computational Linguistics, May
2022, pp. 35–42. [Online]. Available: https://aclanthology.org/2022.constraint-1.5
[37] Z. Zhou, H. Zhao, J. Dong, J. Gao, and X. Liu, “DD-TIG at constraint@ACL2022:

Multimodal understanding and reasoning for role labeling of entities in hateful
memes,” in Proceedings of the Workshop on Combating Online Hostile Posts in
Regional Languages during Emergency Situations. Dublin, Ireland: Association
for Computational Linguistics, May 2022, pp. 12–18. [Online]. Available:
https://aclanthology.org/2022.constraint-1.2
[38] Melodrama and S. . J. of Communication, Villains, victims and heroes: Melodrama,

media, and September 11. Journal of Communication 55, 2005.
[39] M. P. V. M. Christopher Kontio, Klara Gradin, “An exploration of satirical internet

memes effect on brand image,” Linnaeus University.
References 39
[40] S. Sharma, T. Suresh, A. Kulkarni, H. Mathur, P. Nakov, M. S. Akhtar, and

T. Chakraborty, “Findings of the constraint 2022 shared task on detecting the hero,
the villain, and the victim in memes,” in Proceedings of the Workshop on Com-
bating Online Hostile Posts in Regional Languages during Emergency Situations -
CONSTRAINT 2022, Collocated with ACL 2022, 2022.
[41] G. Harika, N. Anvitha, J. Sravani, and K. Sowjanya, “Building an image captioning

system using cnns and lstms,” Int. Res. J. Mod. Eng. Technol. Sci, vol. 2, no. 6, 2020.
[42] I. Feinerer and K. Hornik, wordnet: WordNet Interface, 2020, r package version
0.1-15. [Online]. Available: https://CRAN.R-project.org/package=wordnet
[43] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey

of deep learning for image captioning,” ACM Comput. Surv., vol. 51, no. 6, feb
2019. [Online]. Available: https://doi.org/10.1145/3295748
[44] C. Wang, H. Yang, C. Bartz, and C. Meinel, “Image captioning with

deep bidirectional lstms,” in Proceedings of the 24th ACM International
Conference on Multimedia, ser. MM ’16. New York, NY, USA: Association
for Computing Machinery, 2016, p. 988–997. [Online]. Available: https:
//doi.org/10.1145/2964284.2964299
[45] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Lessons learned
from the 2015 mscoco image captioning challenge,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 39, no. 4, pp. 652–663, 2017.
[46] J. Aneja, A. Deshpande, and A. G. Schwing, “Convolutional image captioning,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2018.
[47] J.-Y. Pan, H.-J. Yang, P. Duygulu, and C. Faloutsos, “Automatic image captioning,”
in 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE
Cat. No.04TH8763), vol. 3, 2004, pp. 1987–1990 Vol.3.
[48] S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image caption-

ing: Transforming objects into words,” in Advances in Neural Infor-
mation Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates,
Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/
680390c55bbd9ce416d1d69a9ab4760d-Paper.pdf
References 40
[49] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the

inception architecture for computer vision,” CoRR, vol. abs/1512.00567, 2015.
[Online]. Available: http://arxiv.org/abs/1512.00567
[50] ——, “Rethinking the inception architecture for computer vision,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June
2016.
[51] M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks for language mod-
eling,” in Thirteenth annual conference of the international speech communication
association, 2012.
[52] Y. Yu, X. Si, C. Hu, and J. Zhang, “A Review of Recurrent Neural Networks:
LSTM Cells and Network Architectures,” Neural Computation, vol. 31, no. 7, pp.
1235–1270, 07 2019. [Online]. Available: https://doi.org/10.1162/neco_a_01199
[53] S. Wang and J. Jiang, “Learning natural language inference with LSTM,” CoRR,
vol. abs/1512.08849, 2015. [Online]. Available: http://arxiv.org/abs/1512.08849
[54] J. A. Alzubi, R. Jain, P. Nagrath, S. Satapathy, S. Taneja, and P. Gupta, “Deep image
captioning using an ensemble of cnn and lstm based deep neural networks,” Journal
of Intelligent & Fuzzy Systems, vol. 40, no. 4, pp. 5761–5769, 2021.
[55] Hartatik, H. Al Fatta, and U. Fajar, “Captioning image using convolutional neural
network (cnn) and long-short term memory (lstm),” in 2019 International Seminar
on Research of Information Technology and Intelligent Systems (ISRITI), 2019, pp.
263–268.
[56] P. Shah, V. Bakrola, and S. Pati, “Image captioning using deep neural architectures,”
in 2017 International Conference on Innovations in Information, Embedded and
Communication Systems (ICIIECS), 2017, pp. 1–4.
[57] E. L. S. Bird, E. Klein, “Natural language processing with python: analyzing text
with the natural language toolkit,” O’Reilly Media, Inc, 2009.
[58] S. Fharook, S. Sufyan Ahmed, G. Rithika, S. S. Budde, S. Saumya, and S. Biradar,

“Are you a hero or a villain? a semantic role labelling approach for detecting
harmful memes.” in Proceedings of the Workshop on Combating Online Hostile
Posts in Regional Languages during Emergency Situations. Dublin, Ireland:
Association for Computational Linguistics, May 2022, pp. 19–23. [Online].
Available: https://aclanthology.org/2022.constraint-1.3

Major Project Report Draft III

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Major Project Report Draft III

Uploaded by

Copyright:

Available Formats

Hero, Villain and Victim: Dissecting

harmful memes for Semantic role labelling

A Major Project Report

Sumith Sai Budde 18BCS101

Under the guidance of

Dr. Sunil Saumya

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Sumith Sai Budde 18BCS101

Date: 05 May 2022 Supervisor

Sumith Sai Budde 18BCS101

Sumith Sai Budde 18BCS101

List of Figures vii

List of Tables viii

3 Task and Dataset Description 5

5.1.3 Similarity Dictionary . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Conclusion and Future Scope 34

3.1 Examples of entities portrayed as heroes, villains, victims and others

4.1 Wu-Palmer similarity score determination . . . . . . . . . . . . . . . . . 11

5.1 Entity sentence linking example . . . . . . . . . . . . . . . . . . . . . . 22

6.1 Output of Framework-I and Framework-II . . . . . . . . . . . . . . . . . 32

3.1 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

6.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

In popular culture, heroes represent bravery, great accomplishments, or other noble

Social media is being adversely affected with the rapid spread of

Task and Dataset Description

3.1 About the Task

• Hero: Entity which is glorified for their actions.

• Villain: Entity that is portrayed negatively in relation with undesirable outcomes

• Other: Entity neither a hero, a villain nor a victim.

3.2 Dataset Description

(a) Train/Validation Dataset Sample (b) Test Dataset Sample

Figure 3.2: Dataset Sample

Train 2700 2852 5552

Total 3381 3552 6933

Table 3.1: Dataset Distribution

3.3 Exploratory Data Analysis

Figure 3.3: Category Wise Entity

Figure 3.4: Entity Wordcloud

(d) OCR Text Word Count Distribution

Figure 3.5: Feature Distribution Plots

4.1 VADER Sentiment

4.2 Wu-Palmer Similarity

4.3 Wu-Palmer Similarity Calculation

Figure 4.1: Wu-Palmer similarity score determination

4.4 Image Captioning

Figure 4.2: Image Captioning example

4.4.1 Architecture of Inception-v3 Model

Inception-v3 is a convolutional neural network(CNN) architecture from the Incep-

1. Factorization into Smaller Convolutions

2. Spatial Factorization into Asymmetric Convolutions

3. Utility of Auxiliary Classifiers

4. Efficient Grid Size Reduction

Figure 4.3: Factorization into Smaller Convolutions

Figure 4.4: Spatial Factorization into Asymmetric Convolutions

Figure 4.5: Efficient Grid Size Reduction

4.4.2 The Final Inception v3 Model

Figure 4.6: Components of Inception-v3

Figure 4.7: Architecture of Inception-v3

4.4.3 Architecture of Long Short-Term Memory(LSTM)

Long Short Term Memory(LSTM) is a kind of recurrent neural network(RNN). In

Figure 4.8: LSTM cell