Professional Documents
Culture Documents
1
Foundation University Islamabad, Pakistan Article QR
Affiliation: 2
Superior University Lahore, Pakistan
Copyright
Information: This article is open access and is distributed under the terms of
Creative Commons Attribution 4.0 International License
A publication of the
1
Usman Ahmed is with Foundation University Islamabad, Pakistan.Usman Ahmed is
corresponding author and available at usman.ahmed@fui.edu.pk
2
Rahmet Bibi, Obaid Ullah, and Rahat Bano are with Superior University Lahore, Pakistan.
unstructured because people are more find relationships among them that
likely to use language without proper uncover patterns of interest. Ontologies
spellings and grammar. It creates are explicit and bring forth the shared
ambiguity in sentences and words and conceptualization of real-world
consequently, semantic and syntactic scenarios [4]. They provide machine-
problems while interpreting this data processable semantics, therefore, are
and uncovering logical patterns from it used in different domains such as
[2]. software engineering, medical science,
and others to automate the different
The problems concerning the
tasks involved. Ontologies are also used
ambiguities in the data can be resolved
in text mining in social media analysis
using textual mining or simply text
[5]. Several approaches are used to
mining. Although uncovering patterns
employ ontologies to infer racial,
from the unstructured data is a challenge
religious or political hate speech using
because it involves searching and
the content shared on social media
analyzing the structured data found in
platforms in the form of posts and
the database to find the patterns of
comments. The current paper is
interest. Text mining, on the other hand,
organized as a literature review about
is a blend of techniques including
the previous techniques and divides
natural language processing (NLP), text
them into the domains of text mining,
analysis, and information retrieval. It is
sentiment analysis and semantics,
used to find out logical patterns from the
followed by a state-of-the-art of these
unstructured data [3]. Text mining is
techniques. Lastly, there are the
more complex than data mining because
conclusion and future directions.
of the nature of natural language.
Moreover, social media is replete with II. LITERATURE REVIEW
unstructured textual content in its posts Social networking platforms such as
and comments. Facebook and Twitter are used to
Several approaches to text mining remove barriers among communities
are found in the literature. However, around the world and to share and
there still exist some major challenges communicate knowledge and thoughts
related to text mining in social media [6]. Unfortunately, these platforms have
including the fact that the data is much also enabled people to commit insults,
larger and dynamic in nature. Moreover, cyberbullying, and various incitements
there are the privacy concerns. The use through racial, religious and political
of semantic approaches for coping with hate speech that lead to social anarchy
such problems in social media is also a [7].
recent trend. It is the approach of giving
meaning to the terms and concepts to
Hate speech as a term was used by Text mining or text analytics can be
Werner [8] and it is defined as an divided into four stages. Firstly, it
intentional act of insult using abusive or involves capturing the data from social
hostile messages. Several other terms media platforms and pre-processing it
are used in the literature including using stemming, tokenization and stop-
cyberbullying for the same purpose. word removal techniques. The extracted
The phenomenon of hate speech has and cleaned data is then represented
negative consequences in the society through models to derive knowledge
and remains punishable around the from it. For the representation and
world by governments using different modeling of the data, mostly BOW (Bag
laws [9], [10]. It is thus necessary for of Words) technique is employed as
any government to detect and respond to described in [11]. It transforms the data
this crime through surveillance. Here, collected from these mediums into
we are concerned with surveillance numeric vector-based representations
involving textual data from various onto which algebraic operations can be
social media websites. applied. However, this technique has a
shortcoming as well when it is used to
Several techniques are used for
analyze the relationship between
analyzing unstructured content on
various isolated pieces of information.
social media. Social media platforms
This semantic gap can only be filled if
are loaded with huge amounts of textual
we use the semantic knowledge base.
data. The communication of the
Another text mining approach [12]
majority of social media users takes the
extracts intellectual data from
form of unstructured data, not using the
Facebook. It uses Facebook API to
exact words and proper sentence
extract the data about several attributes
structure. Although data mining
of users including their age, profile,
techniques have been used to extract
comments, and timeline posts in order
huge amounts of structured data, these
to transform them into different
techniques can’t work well with the
representations using data mining
unstructured and dynamic textual data.
techniques. The aim is to study the users
Therefore, text mining and natural
themselves, their activities and to create
language processing techniques have
their individual profiles. Also, this
become popular for analyzing
approach is concerned with the inter-
unstructured textual data. Text mining
user comparison of their activities and
is an extension of data mining but it is
behavior prediction.
more complex and different because of
its dealing with natural language. It Yet another technique [13] uses
creates meaningful data from Twitter tweets to identify medication
unstructured data patterns [2]. abuse related posts and to monitor them.
It identifies and organizes the tweets of the actor, concept, and instance for
that mention these abuse related the purpose of folksonomy. This idea
medications and then divides them into was inspired by the social tagging
three categories of drug or medication mechanism where the user uses
abuse. The authors also developed a different tags to communicate its
classification technique that is concerns. In this way, the particular
automatically supervised. It social media context is focused on the
distinguishes the posts that signal the construction of such ontology. For
presence of drug abuse and those that do instance, Flickr is used to share photos,
not. NLP and the WordNet semantic CiteULike provides scientific papers’
ontology are also employed to analyze tagging and so on. Although their scope
the posts. For the automated supervised is limited to a single website, the idea of
classification, several available tagging can be reused in other similar
algorithms such as Support Vector applications.
Machines (SVM) and others are used. Mynard et al. [15] presented a real-
Then, all classifiers are further
time semantic framework employing
combined from these algorithms using linked open data as a knowledge base to
stacking for making the final decision use in their semantic searches. They
based on individual predictions. used the GATE-based opensource
A semantic approach by Mika [14] framework for searching and
discusses folksonomy as a dynamic type aggregation to analyze the textual
of ontology related to a particular content in Twitter tweets, highlighting
community-based domain (social them in a generic or domain
network). Folksonomy allows the independent environment using specific
individual user to express shared keywords. They employed this
concepts using its own choice of framework in the domain of political
keywords. The author argues that the science for the study of change in the
very basic and simple concept of political environment and it led them to
ontology restricts folksonomy from successfully predict UK general
being used across the domain while the elections. However, its main focus is not
problems may arise with the temporal a specific domain; rather, it presents a
extension of knowledge and the more generic form of analysis. Also, it’s
evolution of the social community with limited to Twitter only.
its changing members or through In [16], the authors used neural
change in their commitments. This can language models to detect hate speech
invalidate the knowledge contained in from comments. They moved away
that particular ontology. The author from the BOW approach as it may not
presented a tripartite model consisting capture all hate speech if the offender
changes the offensive words and yet be the available supervised or machine
clear in its meaning. Therefore, they learning solutions.
used the low dimensionality approach Another recent empirical study
of CBOW (Continuous Bag of Words) made the use of semantic feature
for the neural language model to representations to better understand the
become more effective and efficient in context of user expression in order to
hate speech detection. In this approach, identify the hate speech intent of the
words are represented in a common user [18]. This approach used the
vector space and the analysts try to external knowledge base of semantics in
predict a central word or comment. feature representations unlike the
There is another technique known as previous approaches. Semantic features
sentiment analysis that attempts to that were employed included Hatebase
detect the sentiment or emotions of the features and FrameNet features.
users from the text in order to determine Hatebase is a multilingual hate speech
their attitude or opinion regarding a knowledge base available online. The
certain topic. This may be thought of as approach used several feature vectors
similar to text analysis. However, it is (H_x) such as hateful meaning, non-
mostly used to determine if some hateful meaning, offensiveness, and
expression or sentence is positive, unambiguousness to average such
negative or neutral. It is also used to rate vectors in order to generate knowledge
a product or service on the basis of these base features of a given social media
reviews. One such approach used for post. FrameNet is another linguistic
analyzing social media content knowledge base that provides meanings
employed unsupervised lexicon based under different semantic frame
classification [17]. The unsupervised categories. Each post is processed
classifier does not require any training through such a frame semantic parser
and therefore produces a positive and and a vocabulary is formulated. This
negative response rating for a given vocabulary is then used to create the
expression because both positive and vector for each post representing the
negative emotions may be present in a count of each frame in a given post.
text. So, this approach makes a ternary Although the FrameNet approach has
prediction based on the input of the improved hate speech detection yet it is
estimates of both positive and negative presently limited to English language
emotions and whichever’s value only and it is also partially affected by
overwhelms the other, becomes the the words with multiple connotations.
answer in the final prediction. Authors The state-of-the-art table of various
demonstrated that their approach data mining techniques is presented
produces better results as compared to below in Table 1 and Table 2. We
TABLE 1
CATEGORIZATION OF TECHNIQUES USED FOR TEXT ANALYSIS IN SOCIAL MEDIA
Author Year Technique Category Main Idea Social
Network
Rahman 2012 Systematic Text User Facebook
Mining Model Mining Attributes
Djuric et 2015 Continuous Bag Text Low Generic
al. of Words Mining Dimensional
(CBOW) Text
Embeddings
Sarker et 2016 Automated Text Combining Twitter
al. Supervised Mining Classifiers
Classification
Paltoglou 2012 Unsupervised Sentiment Linguistics Twitter,
et al. Lexicon Analysis Functions MySpace,
Classification Classifier Digg
TABLE 2
COMPARISON OF TECHNIQUES W.R.T SELECTED PARAMETERS
Technique Category Search Discovery Prediction Validation
To a
Systematic Mining To a certain
Text Mining certain Yes No
Model extent
extent
Continuous Bag of To a certain
Text Mining Yes Yes Yes
Words (CBOW) extent
Automated
Supervised Text Mining Yes Yes Yes No
Classification
Unsupervised
Sentiment To a certain To a certain
Lexicon Yes Yes
Analysis extent extent
Classification
Folksonomy,
tripartite semantic Semantics Yes Yes No Yes
model
GATE semantic To a certain
Semantics Yes Yes Yes
framework extent
Semantic
Semantics Yes Yes No Yes
features
[13] A. Sarker, et al., ''Social media York, NY, USA, 2015, pp. 29–30.
mining for toxicovigilance: automatic https://doi.org/10.1145/2740908.2
monitoring of prescription medication 742760
abuse from Twitter,'' Drug Safety, [17] G. Paltoglou, and M. Thelwall,
vol. 39, no. 3, pp. 231–240, 2016. ''Twitter, MySpace, and Digg:
https://doi.org/10.1007/s40264- unsupervised sentiment analysis in
015-0379-4 social media,'' ACM Transactions
[14] P. Mika, ''Ontologies are us: A on Intelligent Systems and
unified model of social networks Technology, vol. 3, no. 4, pp. 1–
and semantics,'' Web Semantics: 66, 2012. https://doi.org/10.1145/
Science, Services and Agents on the 2337542.2337551
World Wide Web, vol. 5, no. 1, pp. [18] Y. Senarath, and H. Purohit,
5–1, 2007. https://doi.org/10.1016/ ''Evaluating Semantic feature
j.websem.2006. 11.002 representations to efficiently detect
[15] D. Maynard, I. Roberts, M. A. hate intent on social media,'' In:
Greenwood, D. Rout, K. 2020 IEEE 14th International
Bontcheva, ''A framework for real- Conference on Semantic
time semantic social media Computing (ICSC), Irvine, CA,
analysis,'' Web Semantics: Science, USA, 2020, pp. 199–202.
Services and Agents on the World https://doi.org/10.1109/ICSC.2020
Wide Web, vol. 44, pp. 75–88, .00041
2017. [19] F. Sahito, A. Latif, W. Slany,
https://doi.org/10.1016/j.wasman.2 ''Weaving Twitter stream into
017.08.009 Linked Data a proof of concept
[16] N. Djuric, J. Zhou, R. Morris, M. framework,'' In: 2011 7th
Grbovic, V. Radosavljevic, N. International Conference on
Bhamidipati, ''Hate speech Emerging Technologies,
detection with comment Islamabad, Pakistan, 2011, pp. 1–6.
embeddings,'' In: Proceedings of https://doi.org/10.1109/ICET.2011
the 24th International Conference .6048497
on World Wide Web, ACM, New