You are on page 1of 11

Innovative Computing Review (ICR)

Volume 1 Issue 1, Summer 2021


SSN(P): 2791-0024 ISSN(E): 2791-0032 Journal
DOI: https://doi.org/10.32350/icr
Issue DOI: https://doi.org/10.32350/icr.0101
Homepage: https://journals.umt.edu.pk/index.php/icr

Hate Speech Detection in Social Media Surveillance: A Journal QR


Article:
Review of Related Literature

Author(s): Usman Ahmed1, Rahmet Bibi2, Obaid Ullah2, Rahat Bano2

1
Foundation University Islamabad, Pakistan Article QR
Affiliation: 2
Superior University Lahore, Pakistan

Received: March 14, 2021


Article History: Revised: May 19, 2021
Accepted: June 30, 2021
Available Online: June 30, 2021

U, Ahmed, B, Rahmet, U, Obid, and B, Rahat, “Hate speech


Citation: detection in social media surveillance: A review of related
literature”, Innova Comput Rev, vol. 1, no. 1, pp. 01–11, 2021.
https://doi.org/10.32350/icr.0101.01

Copyright
Information: This article is open access and is distributed under the terms of
Creative Commons Attribution 4.0 International License

A publication of the

School of Systems and Technology

University of Management and Technology, Lahore, Pakistan


Hate Speech Detection in Social Media Surveillance:
A Review of Related Literature
Usman Ahmed1, Rahmet Bibi2, Obaid Ullah2, Rahat Bano2

ABSTRACT: Social media surveillance is analysis, social media, surveillance, text


a requirement for governments and mining
intelligence agencies around the world
I. INTRODUCTION
to detect and prevent hate crimes. The
dynamic and unstructured nature of the Today, social media platforms have
textual content available on social made it possible for people around the
media platforms makes it very complex world to connect and communicate with
to extract hate related speech patterns each other, easily. However, it has also
from this content. It also creates made it easy for criminals to perform
ambiguities in the data and therefore, criminal activities using these
data mining techniques become difficult platforms. Governments and
to apply in this scenario. Several intelligence agencies need to detect and
alternative techniques were adopted by prevent such criminal activities using
different researchers in the past to cope data mining and other similar
with this problem and to capture and techniques on the data collected from
analyze such unstructured text for the these sources. Although, there has been
purpose of hate speech detection. In this an extensive debate about the amount of
paper, we reviewed, categorized and private information that can be gathered
presented a state-of-the-art of these for surveillance by the governments and
techniques which were divided into agencies [1].
three categories namely text mining, Social media platforms such as
sentiment analysis and semantics. The Facebook, Twitter and LinkedIn
challenges in the application of the provide extensive opportunities for the
existing techniques were also discussed masses to mutually connect and engage
and these can be taken up as future in order to share knowledge and to
directions. communicate with each other. Textual
INDEX TERMS: data mining, hate data communicated over social media
speech, NLP, semantics, sentiment platforms in the form of comments,
messages, and posts remains mostly

1
Usman Ahmed is with Foundation University Islamabad, Pakistan.Usman Ahmed is
corresponding author and available at usman.ahmed@fui.edu.pk
2
Rahmet Bibi, Obaid Ullah, and Rahat Bano are with Superior University Lahore, Pakistan.

Innovative Computing Review


2
Volume 1 Issue 1, Summer 2021
Ahmed et al.

unstructured because people are more find relationships among them that
likely to use language without proper uncover patterns of interest. Ontologies
spellings and grammar. It creates are explicit and bring forth the shared
ambiguity in sentences and words and conceptualization of real-world
consequently, semantic and syntactic scenarios [4]. They provide machine-
problems while interpreting this data processable semantics, therefore, are
and uncovering logical patterns from it used in different domains such as
[2]. software engineering, medical science,
and others to automate the different
The problems concerning the
tasks involved. Ontologies are also used
ambiguities in the data can be resolved
in text mining in social media analysis
using textual mining or simply text
[5]. Several approaches are used to
mining. Although uncovering patterns
employ ontologies to infer racial,
from the unstructured data is a challenge
religious or political hate speech using
because it involves searching and
the content shared on social media
analyzing the structured data found in
platforms in the form of posts and
the database to find the patterns of
comments. The current paper is
interest. Text mining, on the other hand,
organized as a literature review about
is a blend of techniques including
the previous techniques and divides
natural language processing (NLP), text
them into the domains of text mining,
analysis, and information retrieval. It is
sentiment analysis and semantics,
used to find out logical patterns from the
followed by a state-of-the-art of these
unstructured data [3]. Text mining is
techniques. Lastly, there are the
more complex than data mining because
conclusion and future directions.
of the nature of natural language.
Moreover, social media is replete with II. LITERATURE REVIEW
unstructured textual content in its posts Social networking platforms such as
and comments. Facebook and Twitter are used to
Several approaches to text mining remove barriers among communities
are found in the literature. However, around the world and to share and
there still exist some major challenges communicate knowledge and thoughts
related to text mining in social media [6]. Unfortunately, these platforms have
including the fact that the data is much also enabled people to commit insults,
larger and dynamic in nature. Moreover, cyberbullying, and various incitements
there are the privacy concerns. The use through racial, religious and political
of semantic approaches for coping with hate speech that lead to social anarchy
such problems in social media is also a [7].
recent trend. It is the approach of giving
meaning to the terms and concepts to

School of Systems and Technology


3
Volume 1 Issue 1, Summer 2021
Hate Speech Detection in Social Media Surveillance…

Hate speech as a term was used by Text mining or text analytics can be
Werner [8] and it is defined as an divided into four stages. Firstly, it
intentional act of insult using abusive or involves capturing the data from social
hostile messages. Several other terms media platforms and pre-processing it
are used in the literature including using stemming, tokenization and stop-
cyberbullying for the same purpose. word removal techniques. The extracted
The phenomenon of hate speech has and cleaned data is then represented
negative consequences in the society through models to derive knowledge
and remains punishable around the from it. For the representation and
world by governments using different modeling of the data, mostly BOW (Bag
laws [9], [10]. It is thus necessary for of Words) technique is employed as
any government to detect and respond to described in [11]. It transforms the data
this crime through surveillance. Here, collected from these mediums into
we are concerned with surveillance numeric vector-based representations
involving textual data from various onto which algebraic operations can be
social media websites. applied. However, this technique has a
shortcoming as well when it is used to
Several techniques are used for
analyze the relationship between
analyzing unstructured content on
various isolated pieces of information.
social media. Social media platforms
This semantic gap can only be filled if
are loaded with huge amounts of textual
we use the semantic knowledge base.
data. The communication of the
Another text mining approach [12]
majority of social media users takes the
extracts intellectual data from
form of unstructured data, not using the
Facebook. It uses Facebook API to
exact words and proper sentence
extract the data about several attributes
structure. Although data mining
of users including their age, profile,
techniques have been used to extract
comments, and timeline posts in order
huge amounts of structured data, these
to transform them into different
techniques can’t work well with the
representations using data mining
unstructured and dynamic textual data.
techniques. The aim is to study the users
Therefore, text mining and natural
themselves, their activities and to create
language processing techniques have
their individual profiles. Also, this
become popular for analyzing
approach is concerned with the inter-
unstructured textual data. Text mining
user comparison of their activities and
is an extension of data mining but it is
behavior prediction.
more complex and different because of
its dealing with natural language. It Yet another technique [13] uses
creates meaningful data from Twitter tweets to identify medication
unstructured data patterns [2]. abuse related posts and to monitor them.

Innovative Computing Review


4
Volume 1 Issue 1, Summer 2021
Ahmed et al.

It identifies and organizes the tweets of the actor, concept, and instance for
that mention these abuse related the purpose of folksonomy. This idea
medications and then divides them into was inspired by the social tagging
three categories of drug or medication mechanism where the user uses
abuse. The authors also developed a different tags to communicate its
classification technique that is concerns. In this way, the particular
automatically supervised. It social media context is focused on the
distinguishes the posts that signal the construction of such ontology. For
presence of drug abuse and those that do instance, Flickr is used to share photos,
not. NLP and the WordNet semantic CiteULike provides scientific papers’
ontology are also employed to analyze tagging and so on. Although their scope
the posts. For the automated supervised is limited to a single website, the idea of
classification, several available tagging can be reused in other similar
algorithms such as Support Vector applications.
Machines (SVM) and others are used. Mynard et al. [15] presented a real-
Then, all classifiers are further
time semantic framework employing
combined from these algorithms using linked open data as a knowledge base to
stacking for making the final decision use in their semantic searches. They
based on individual predictions. used the GATE-based opensource
A semantic approach by Mika [14] framework for searching and
discusses folksonomy as a dynamic type aggregation to analyze the textual
of ontology related to a particular content in Twitter tweets, highlighting
community-based domain (social them in a generic or domain
network). Folksonomy allows the independent environment using specific
individual user to express shared keywords. They employed this
concepts using its own choice of framework in the domain of political
keywords. The author argues that the science for the study of change in the
very basic and simple concept of political environment and it led them to
ontology restricts folksonomy from successfully predict UK general
being used across the domain while the elections. However, its main focus is not
problems may arise with the temporal a specific domain; rather, it presents a
extension of knowledge and the more generic form of analysis. Also, it’s
evolution of the social community with limited to Twitter only.
its changing members or through In [16], the authors used neural
change in their commitments. This can language models to detect hate speech
invalidate the knowledge contained in from comments. They moved away
that particular ontology. The author from the BOW approach as it may not
presented a tripartite model consisting capture all hate speech if the offender

School of Systems and Technology


5
Volume 1 Issue 1, Summer 2021
Hate Speech Detection in Social Media Surveillance…

changes the offensive words and yet be the available supervised or machine
clear in its meaning. Therefore, they learning solutions.
used the low dimensionality approach Another recent empirical study
of CBOW (Continuous Bag of Words) made the use of semantic feature
for the neural language model to representations to better understand the
become more effective and efficient in context of user expression in order to
hate speech detection. In this approach, identify the hate speech intent of the
words are represented in a common user [18]. This approach used the
vector space and the analysts try to external knowledge base of semantics in
predict a central word or comment. feature representations unlike the
There is another technique known as previous approaches. Semantic features
sentiment analysis that attempts to that were employed included Hatebase
detect the sentiment or emotions of the features and FrameNet features.
users from the text in order to determine Hatebase is a multilingual hate speech
their attitude or opinion regarding a knowledge base available online. The
certain topic. This may be thought of as approach used several feature vectors
similar to text analysis. However, it is (H_x) such as hateful meaning, non-
mostly used to determine if some hateful meaning, offensiveness, and
expression or sentence is positive, unambiguousness to average such
negative or neutral. It is also used to rate vectors in order to generate knowledge
a product or service on the basis of these base features of a given social media
reviews. One such approach used for post. FrameNet is another linguistic
analyzing social media content knowledge base that provides meanings
employed unsupervised lexicon based under different semantic frame
classification [17]. The unsupervised categories. Each post is processed
classifier does not require any training through such a frame semantic parser
and therefore produces a positive and and a vocabulary is formulated. This
negative response rating for a given vocabulary is then used to create the
expression because both positive and vector for each post representing the
negative emotions may be present in a count of each frame in a given post.
text. So, this approach makes a ternary Although the FrameNet approach has
prediction based on the input of the improved hate speech detection yet it is
estimates of both positive and negative presently limited to English language
emotions and whichever’s value only and it is also partially affected by
overwhelms the other, becomes the the words with multiple connotations.
answer in the final prediction. Authors The state-of-the-art table of various
demonstrated that their approach data mining techniques is presented
produces better results as compared to below in Table 1 and Table 2. We

Innovative Computing Review


6
Volume 1 Issue 1, Summer 2021
Ahmed et al.

TABLE 1
CATEGORIZATION OF TECHNIQUES USED FOR TEXT ANALYSIS IN SOCIAL MEDIA
Author Year Technique Category Main Idea Social
Network
Rahman 2012 Systematic Text User Facebook
Mining Model Mining Attributes
Djuric et 2015 Continuous Bag Text Low Generic
al. of Words Mining Dimensional
(CBOW) Text
Embeddings
Sarker et 2016 Automated Text Combining Twitter
al. Supervised Mining Classifiers
Classification
Paltoglou 2012 Unsupervised Sentiment Linguistics Twitter,
et al. Lexicon Analysis Functions MySpace,
Classification Classifier Digg

Mika 2007 Folksonomy, Semantics Social Generic


tripartite Tagging
semantic model
Mynard 2017 GATE semantic Semantics Keyword Twitter
et al. framework Search
Declarative
Senarath 2020 Semantic Semantics Knowledge Twitter
et al. features Based
Semantic
Features

divided all such techniques previously Table 1 shows the categories in


used for the analysis of the text in social which these works fall and also depicts
media into three categories namely text the platform on which the current work
mining, sentiment analysis, and is carried out. Techniques and the main
semantics. Certain parameters including idea show the adopted approach.
main idea, search, discovery, Table 2 shows the comparison
prediction, and methodology validation among the parameters we have
were chosen to draw a comparison specified, that is, search that
among different techniques. corresponds to the searching of words
and phrases from the social media and

School of Systems and Technology


7
Volume 1 Issue 1, Summer 2021
Hate Speech Detection in Social Media Surveillance…

TABLE 2
COMPARISON OF TECHNIQUES W.R.T SELECTED PARAMETERS
Technique Category Search Discovery Prediction Validation
To a
Systematic Mining To a certain
Text Mining certain Yes No
Model extent
extent
Continuous Bag of To a certain
Text Mining Yes Yes Yes
Words (CBOW) extent
Automated
Supervised Text Mining Yes Yes Yes No
Classification
Unsupervised
Sentiment To a certain To a certain
Lexicon Yes Yes
Analysis extent extent
Classification
Folksonomy,
tripartite semantic Semantics Yes Yes No Yes
model
GATE semantic To a certain
Semantics Yes Yes Yes
framework extent
Semantic
Semantics Yes Yes No Yes
features

discovery that corresponds to the correspond to approaches and


identification of the true intended techniques adopted by several other
meaning of the text as hate speech. researchers. It has been noted in the
Another parameter is prediction which current study that most approaches use
refers to the future prediction of such some keywords related to a specific
actions using the given text or domain, for instance hate speech in this
predicting other contents or intentions scenario, as a targeted search across
of the user by analyzing its contents. social media posts. This approach is
Lastly, the last column of the table mostly observed in the text mining
indicates the validation status of the technique. Some researchers [12] have
technique. adopted the approach of profiling the
activities of individual users to detect
These studies provide valuable
hate speech. Others [16] tried to model
insights into the approaches and
the captured data through graphs and
strategies used in the past for natural
other representations, so as to apply
text analysis of the social media content.
mathematical operations to detect or
The works mentioned here also

Innovative Computing Review


8
Volume 1 Issue 1, Summer 2021
Ahmed et al.

predict hate speech using partial III. CONCLUSION


content. Another interesting but
Hate speech on social media has
complex approach is to combine the been the cause of several catastrophes
results of multiple classifiers to give a and social anarchy in the past. The
final one as an automated yet supervised powerful impact of social media and our
classification. The idea of detecting reliance on them has increased and so
sentiments and emotions as in [17] can have the chances of hate speech.
be very useful in hate speech detection. Textual surveillance by governments is
However, what if the attacker uses therefore necessary to detect and
sarcastic language that may not contain prevent hate crimes. Most previous
abuse related words but still attacks approaches in data mining can’t work
some group or individual using well with the ambiguous and
sarcasm? This problem can only be unstructured data emerging from social
dealt with if we analyze the relationship media platforms. Several techniques
of words and sentences as in ontologies. have been used to solve this problem.
Ontologies have also been used as
We have presented in this paper a
semantic approaches for the analysis of review of the various types of
textual content. techniques used in the effort to detect
In the field of semantics, most and analyze textual content on social
approaches by different researchers use media platforms. These techniques can
linked open data for identifying hate be divided into text mining, sentiment
speech and analyzing the text from analysis, and semantics. Each has its
social media as in [15], [19]. Although own advantages and shortcomings. We
this approach is helpful partially, such argued that analyzing responses or
data can only provide us with the comments can provide insights into the
structured knowledge base that might presence and severity of hate-related
not be successful in most of the cases social media posts. The use of semantic
where the data obtained from the posts techniques has proven to be more
and comments is unstructured, yet efficient. However, there is still a need
incorporates hate speech and abusive or to explore the features of semantic web
slang language. An ontology can be technologies for the detection of hate
built that may represent these speech in social media, in particular to
unstructured terminologies in a cope with the problems of multiple
structured and semantic manner as a interpretations and sarcasm. The state-
knowledge base that can reuse of-the-art presented in this paper can
structured concepts from linked open help the researchers to identify
data as well. Currently, there is no problems and present better solutions in
generic ontology on hate speech. the future.

School of Systems and Technology


9
Volume 1 Issue 1, Summer 2021
Hate Speech Detection in Social Media Surveillance…

REFERENCES Hate speech detection on


Facebook,'' In Proceedings of the
[1] D. Roark, ''The end of privacy for
the populace, the person of interest First Italian Conference on
Cybersecurity (ITASEC17), 2017,
and the persecuted,'' Health and
pp. 86-95.
Technology, vol. 7, no. 4, pp. 501–
517, 2017. [8] W. Warner, and J. Hirschberg,
[2] R. Irfan, et al., ''A survey on text ''Detecting Hate Speech on the
World Wide Web,'' In:
mining in social networks,'' The
Knowledge Engineering Review, Proceedings of the second
vol. 30, no, 2, pp. 157–170, 2015. workshop on language in social
media, Association for
https://doi.org/10.1017/S026988891
Computational Linguistics,
4000277
Stroudsburg, PA, USA, 2012, pp.
[3] R. Feldman, and J. Sanger, The text 19–26.
mining handbook: advanced
[9] B. Parekh, ''Hate Speech,'' Public
approaches in analyzing unstructured
Policy Research, vol. 12, no. 4, pp.
data. Cambridge university press,
213–223, 2006. https://doi.org/
2007.
10.1111/j.1070-535.2005.00405.x
[4] N. Guarino, Formal ontology in
[10] R. M. Simpson, ''Dignity, harm,
information systems. IOS Press,
and hate speech,'' Law and
1998.
Philosophy, vol. 32, pp. 701–728,
[5] A. Hotho, R. Jäschke, K. Lerman, 2013. https://doi.org/10.1007/ s10982-
''Mining social semantics on the 012-9164-z
social web,'' Semantic Web, vol. 8,
[11] X. Hu, and H. Liu, ''Text analytics
no. 5, pp. 623–624, 2017.
in social media,'' In: Mining text
http://dx.doi.org/10.3233/SW-170272
data, C. C. Aggarwal, and C. Zhai,
[6] L. Sorensen, ''User managed trust eds. Springer US, Boston, MA,
in social networking - Comparing 2012, pp. 385–414.
Facebook, MySpace and Linkedin,'' In: https://link.springer.com/chapter/1
2009 1st International Conference on 0.1007/978-1-4614-3223-4_12
Wireless Communication, Vehicular
[12] M. M. Rahman, ''Mining social
Technology, Information Theory and
data to extract intellectual
Aerospace Electronic Systems
knowledge,'' International Journal
Technology, 2009, pp. 427–431.
of Intelligent Systems and
[7] F. Del Vigna, A. Cimino, F. Applications, vol. 4, no. 10, 2012.
Dell’Orletta, M. Petrocchi, M. https://doi.org/10.5815/ijisa.2012.10.02
Tesconi, ''Hate me, hate me not:

Innovative Computing Review


10
Volume 1 Issue 1, Summer 2021
Ahmed et al.

[13] A. Sarker, et al., ''Social media York, NY, USA, 2015, pp. 29–30.
mining for toxicovigilance: automatic https://doi.org/10.1145/2740908.2
monitoring of prescription medication 742760
abuse from Twitter,'' Drug Safety, [17] G. Paltoglou, and M. Thelwall,
vol. 39, no. 3, pp. 231–240, 2016. ''Twitter, MySpace, and Digg:
https://doi.org/10.1007/s40264- unsupervised sentiment analysis in
015-0379-4 social media,'' ACM Transactions
[14] P. Mika, ''Ontologies are us: A on Intelligent Systems and
unified model of social networks Technology, vol. 3, no. 4, pp. 1–
and semantics,'' Web Semantics: 66, 2012. https://doi.org/10.1145/
Science, Services and Agents on the 2337542.2337551
World Wide Web, vol. 5, no. 1, pp. [18] Y. Senarath, and H. Purohit,
5–1, 2007. https://doi.org/10.1016/ ''Evaluating Semantic feature
j.websem.2006. 11.002 representations to efficiently detect
[15] D. Maynard, I. Roberts, M. A. hate intent on social media,'' In:
Greenwood, D. Rout, K. 2020 IEEE 14th International
Bontcheva, ''A framework for real- Conference on Semantic
time semantic social media Computing (ICSC), Irvine, CA,
analysis,'' Web Semantics: Science, USA, 2020, pp. 199–202.
Services and Agents on the World https://doi.org/10.1109/ICSC.2020
Wide Web, vol. 44, pp. 75–88, .00041
2017. [19] F. Sahito, A. Latif, W. Slany,
https://doi.org/10.1016/j.wasman.2 ''Weaving Twitter stream into
017.08.009 Linked Data a proof of concept
[16] N. Djuric, J. Zhou, R. Morris, M. framework,'' In: 2011 7th
Grbovic, V. Radosavljevic, N. International Conference on
Bhamidipati, ''Hate speech Emerging Technologies,
detection with comment Islamabad, Pakistan, 2011, pp. 1–6.
embeddings,'' In: Proceedings of https://doi.org/10.1109/ICET.2011
the 24th International Conference .6048497
on World Wide Web, ACM, New

School of Systems and Technology


11
Volume 1 Issue 1, Summer 2021

You might also like