Professional Documents
Culture Documents
Bhavesh Pariyani Krish Shah Meet Shah Tarjni Vyas Sheshang Degadwala
MTech Data Science, MTech Data Science, MTech Data Science, Assistant professor, Associate Professor &
Nirma University Nirma University Nirma University Nirma University Head of Department
Ahmedabad, Gujarat Ahmedabad, Gujarat Ahmedabad, Gujarat Ahmedabad, Gujarat Computer Engineering
380060, India 380060, India 380060, India 380060, India Sigma Institute of
19mced09@nirmauni. 19mced13@nirmauni. 19mced14@nirmauni. tarjni.vyas@nirmauni. Engineering
ac.in ac.in ac.in ac.in sheshang13@gmail.co
m
Abstract—Twitter's central goal is to enable everybody to labels. Though unsupervised learning is the training system
make and share thoughts and data, and to communicate their function in which data set is neither categorized nor named[3].
suppositions and convictions without boundaries. Twitter's job is
to serve the public discussion, which requires portrayal of a
Supervised learning is further divided into two types
different scope of points of view. Yet, it does not advance
viciousness against or straightforwardly assault or undermine
regression and classification based on labels of dataset.Here
others based on race, nationality, public cause, rank, sexual we concerned only about classification. Classification machine
direction, age, inability, or genuine illness. Hate S peech can hurt algorithms used categorical dataset and are used to classify the
a person or a community. S o, it is not appropriate to use hate class/category of the unknown instance. Various machine
speech. Now, due to increase in social media usage, hate speech is learning application includes task that can be set up as
very commonly used on these platforms. S o, it is not possible to supervised. We aim to do this task by applying supervised
identify hate speeches manually. S o, it is essnetial to develop an classification methods like Support vector machine, logistic
automated hate speech detection model and this resaech work regression and random forest on labeled hate speech dataset.
shows different approaches of Natural Language Processing for
Each instance is represented in form of vector the length of
classification of Hate S peech through Machine Learning
Algorithms.
vector is dependent on the method used For representation of
Keywords—Logistic Regression; S VM; Tf-Idf; Random tweets. In this paper we used two types approaches for vecto r
Forest; Hate S peech; Bag of Words) representation of a tweet, vector term frequency- inverse
document frequency(tf-idf) and bag of words .
I. INT RODUCT ION
Anti s emitism
Due to increasing scale of social media, people are using social 8.70% Anti Mi grant Hatred
media platforms to post their views. Giving opinions which are 17.80%
harsh or rude to someone directly on face is a difficult task. So,
people feel it is safe over internet to abus e or post something Speci fic Religion Hatred
offensive to others. So, they feel secured posting such content 17.70%
on the internet. Due to this the use of hate speech over the
social media is increasing daily. So, as to handle such a large Ethni c Ori gin
data of users over social media, automatic detection of hate 15.80%
speech methods are required. In this paper we use machine
learning methods to classify whether hate speech or not[13].
Ra ce Gender
There are a number of machine learning applications, One of Sexual Ori netation
8.70% 2.80%
12.70%
them is for text based classification. Each instance or here we Rel igion Other Na ti onal Ori gin
can say each tweet is represented using the same set of features 4.50% 2.20% 9.10%
used by machine learning algorithms. There are two types of
problems solved machine learning algorithms, supervised and Fig : 1 Types of hate content removed from platforms
unsupervised. Supervised learning is the task of training model that have signed the EU Code of Conduct (January 2018 )
based on given dataset containing both set of features and
Authorized licensed use limited to: Carleton University. Downloaded on June 14,2021 at 23:17:06 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
A. Hate Speech Knowledge of annotator for hate speech was examined in Ref.
15. Authors produce some very good results in amateur
In old times, Hate Speech was limited to face to face annotation in comparison to expert annotations. Also, Waseem
conversations. But now due to increase in social media provide its own dataset and its evaluation. To penalize
platforms the usage of hate speech is increasing. As people misclassification on minority classes weighted F1-score is
feel they are hidden on the internet. Due to this, people feel suggested as an evaluation measure.
safe to use hate speech and it is human computiv e task to
identify hate speech on social media so we need some Nowadays with development in deep learning, CNN can be
automated techniques to detect hate speech. used for hate speech detection [2],[1]. Word-vector also known
On the other side, individuals are more likely to share their as word embedding can be trained on relevant corpus of the
views online, thereby leading to the dissemination of hate domain. This pretrained word-vectors are used in CNN [2] .
speech. Given that this sort of prejudiced contact may be Most of machine learning models uses bag-of words which
particularly detrimental to society, policymakers and social fails to capture patterns and sequences. It can be understood by
networking sites may profit from monitoring and prevention the example in Ref. 2. “if a tweet ends saying if you know what
devices[6]. I mean” here each word can be considered as hate speech but it
Hate speech is generally described as any contact that distorts is most likely that this sentence is hate speech. This type of
a individual or community on the basis of characteristics such features cannot be handled by bag of words which degrades the
as color, ethnicity, gender, sexual preference, nationality or performance of traditional machine learning algorithm.
religion[8].
III. DAT ASET DESCRIPT ION
B. Definition of hate speech
Dataset was obtained from the online social media platform
According to Paula Fortuna and Sergia Nunes “Hate speech is twitter. It can be easily found on the GitHub where previous
language that attacks or diminishes, that incites violence or researchers have uploaded different datasets for hate speech
hate against groups, based on specific characteristics such as detection.
physical appearance, religion, descent, national or ethnic Table 1. Classwise distribution of dataset
origin, sexual orientation, gender identity or other, and it can Class No. Of Instances
occur with different linguistic styles, even in subtle forms or Class 0 29720
when humor is used ”[6]. Class 1 2242
a
Class 0 – non hate speech
b
Class 1 – hate speech
II. RELAT ED W ORK 20 % of data is used as testing and 80% of as training from
each class
There are many approaches for detection of hate speech. But
they differ from each other based on the output they obtained in Table 2. Classwise distribution of Training Dataset
Ref. 8 hate speech was classified into three classes race, Class No. of Instances
nationality and religion. Ref. 8 uses sentiment analysis Class 0 23775
technique for detection of hate speech but just not detecting but Class 1 1794
they also classified into one of the three classes and also rate
the polarity of speech. TABLE 3. CLASSWISE DIST IBUT ION OF TEST ING DAT ASET
Class No. of Instances
We found two survey papers for automatic hate speech Class 0 5945
detection [6],[14]. In Ref. 6 motivation for hate speech
Class 1 448
detection is shown and why it became necessary to develop
more robust and accurate models for automatic hate speech
Table 4. Top 11 words with Highest Frequency in Hate speech
detection.
Amp Hate
The problem of hate speech detection is more often researcher Trump Women
keep data private while collecting it and there are less open White Might Libtard
source code available which make it difficult for comparative Allahsoil Libtard Libtard
study [6] . This degrades the progress in this field. Different Black Libtard Sjw
features related to hate speech are described in Ref. 14, like Racist
simple surface feature which includes bag of words, unigrams
or n-grams. Both training set and testing set need to have same
predictive word but it is problem as detection of hate speech is
applied on very small piece of text so to overcome this issue
word generalization is applied [14].
Authorized licensed use limited to: Carleton University. Downloaded on June 14,2021 at 23:17:06 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
IV. FEAT URE EXT RACT ION TF: Term Frequency, which measures how much a word
appears in a text. Although each document is specific in
Feature Extraction is a method to convert each tweet into a duration, it is likely that the word will occur far more often in
fixed set of attributes so that it can be easily interpreted by longer documents than in shorter ones.:
machine learning models. In feature extraction, a vocabulary of
words is being generated and vocabulary depends upon the େ୭୳୬୲୭୲ୣ୰୫୲୧୬ୟୢ୭ୡ୳୫ୣ୬୲
ሺሻ ൌ (1)
method we use for feature extraction. This vocabulary is used ୭୲ୟ୪ୡ୭୳୬୲୭୲ୣ୰୫ୱ ୧୬୲୦ୣୢ୭ୡ୳୫ୣ୬୲
Authorized licensed use limited to: Carleton University. Downloaded on June 14,2021 at 23:17:06 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
corpus = ["I am the trees the forest my home", We need a method to interpret textual data for the machine
"I am the pollen and the broken leaves", learning algorithm, and the bag-of-word model allows us to
"I am canopy","I am the breeze", accomplish that target.
"I am here to please" The bag of words model is easy to understand and apply. In
] this method we create tokens of each sentence and the calculate
frequency of each token [12].
Step 1: Calculate the term frequency For example.
Calculating term frequency is a pretty straight forward it is “i am thankful for having a paneer today”
calculated as the number of times the word or a term appear in “i get to see my daddy today!”
a document.
First step:
Authorized licensed use limited to: Carleton University. Downloaded on June 14,2021 at 23:17:06 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
A. Tokenizing
B. Stop Words
Stop words are those words which has less meaning or are
useless for e.g, a,the,of which often occurs in most sentences.
So it is required to remove this stop words otherwise it will
Figure 2 S igmoid Function
cause misclassification.
C. Stemming
ܺ ൌ ሺܺଵ ǡ ܺଶ ǡ ڮǡ ܺ ሻ (4)
Logistic Regression Equation Stemming is process in which the prefix or suffix of a word is
removed to make similar in common form.For e.g processing,
process and processed have basically same meaning if we
Authorized licensed use limited to: Carleton University. Downloaded on June 14,2021 at 23:17:06 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
ignore the tense so it is required to convert all this word in Table 15. RF with BOW
similar form. For this well known english stemmer, porter Not Hate Speech Hate Speech
stemmer is used here. Not Hate Speech 5928 17
Hate Speech 220 228
D. Case Folding
2) Accuracy Score & F1-Score
In case folding all word are changed in lowercase. This is used Table 16. accuracy score F1 score
to make the vocabulary as small as possible. Model F1-Score Accuracy-Score
SVM with tfidf 0.4933 0.9524
VII. RESULT S SVM with BOW 0.5518 0.9560
LR with tfidf 0.2704 0.9401
First SVM, logistic regression and random forest are used with LR with BOW 0.6538 0.9605
default parameters with bag of words and tf-idf representation
RF with tfidf 0.5475 0.9560
without any preprocessing. Size of feature vector in all
RF with BOW 0.6580 0.9629
representation, with or without preprocessing is shown in
below table
We can see that using preprocessing size of vectors are B. Data with preprocessing and using Gridsearchcv
decreased which decreases the computation time. 1) Gridsearchcv results
Authorized licensed use limited to: Carleton University. Downloaded on June 14,2021 at 23:17:06 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
REFERENCES
Table 23. RF with BOW
Not Hate Speech Hate Speech [1] Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma.
Deep learning for hate speech detection in tweets. In Proceedings of the 26th
Not Hate Speech 5914 31 International Conference on World Wide Web Companion, pages 759–760,
Hate Speech 199 249 2017.
3) Accuracy Score & F1-Score [2] Md Abul Bashar and Richi Nayak. Qutnocturnal@ hasoc’19: Cnn for hate
speech and offensive content identification in hindi language. In Proceedings of
the 11th annual meeting of the Forum for Information Retrieval Evaluation
Table 24. accuracy score F1 score (December 2019), 2019.
Model F1-Score Accuracy-Score [3] Pete Burnap and Matthew L Williams. Cyber hate speech on twitter: An
SVM with tfidf 0.7488 0.9668 application of machine classification and statistical modeling for policy and
decision making. Policy & Internet, 7(2):223 –242, 2015.
SVM with BOW 0.7101 0.9630 [4] Fabio Del Vigna12, Andrea Cimino23, Felice Dell’Orletta, Marinella
LR with tfidf 0.7327 0.9654 Petrocchi, and Maurizio Tesconi. Hate me, hate me not: Hate speech detection
LR with BOW 0.7055 0.9629 on facebook. In Proceedings of the First Italian Conference on Cybersecurity
(IT ASEC17), pages 86–95, 2017.
RF with tfidf 0.5432 0.9529 [5] Shimaa M Abd El-Salam, Mohamed M Ezz, Somaya Hashem, Wafaa
RF with BOW 0.6840 0.9640 Elakel, Rabab Salama, Hesham ElMakhzangy, and Mahmoud ElHefnawi.
Performance of machine learning approaches on prediction of esophageal
varices for egyptian chronic hepatitis c patients. Inform atics in Medicine
CONCLUSION Unlocked, 17:100267, 2019.
[6] Paula Fortuna and S´ergio Nunes. A survey on automatic detection of hate
In routine life, as the usage of social media is increased speech in text. ACM Computing Surveys (CSUR), 51(4):1 –30, 2018.
everyone seems to think like they can speak or write anything [7] Purnama Sari Br Ginting, Budhi Irawan, and Casi Setianingsih. Hate
speech detection on twitter using multinomial logistic regression classification
they want. Due to this thinking hate speech has been increased method. In 2019 IEEE International Conference on Internet of T hings and
so it becomes necessary to automate the process of classifying Intelligence System (IoT aIS), pages 105–111. IEEE, 2019.
the hate speech data. To simplify the process of classifying of [8] Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura Damien, and Jun
hate speech we have used machine learning approach to detect Long. A lexicon-based approach for hate speech detection. International
Journal of Multimedia and Ubiquitous Engineering, 10(4):215 –230, 2015.
hate speech from the twitter data. For this we have used tf-idf [9] Ammar Ismael Kadhim. Term weighting for feature extraction on twitter:
and bag of words methods to extract feature from the tweets. A comparison between bm25 and tf-idf. In 2019 International Conference on
To classify hate speech from the tweets we have implemented Advanced Science and Engineering (ICOASE), pages 124–128. IEEE, 2019.
machine learning algorithms like SVM, Logistic Regression [10] Harpreet Kaur, Veenu Mangat, and Nidhi Krail. Dictionary -based
sentiment analysis of hinglish text and comparison with machine learning
and Random Forest. We can conclude from the results algorithms. International Journal of Metadata, Semantics and Ontologies, 12(2-
obtained that by using Data without preprocessing and 3):90–102, 2017.
machine learning models with default parameters, Random [11] M Ramya and J Alwin Pinakas. Different type of feature selection for text
Forest with bag of words gives best performance with 0.6580 classification. International Journal of Computer T rends and T echno logy,
10(2):102–107, 2014
F1 Score and 0.9629 Accuracy Score. But as explained earlier [12] Joni Salminen, Maximilian Hopf, Shammur A Chowdhury, Soon -gyo
only obtaining highest accuracy is not enough when we are Jung, Hind Almerekhi, and Bernard J Jansen. Developing an online hate
dealing with imbalance class dataset. For that we have used classifier for multiple social media platforms. Human-centric Computing and
here F1 score which is quite low for data without Information Sciences, 10(1):1, 2020.
[13] T YSS Santosh and KVS Aravind. Hate speech detection in hindi-english
preprocessing. To improve this we have used some code-mixed social media text. In Proceedings of the ACM India Joint
preprocessing steps and gridsearch to obtained best parameter International Conference on Data Science and Management of Data, pages
for machine learning model. After preprocessing and using 310–313, 2019.
gridsearch SVM with Tf-IDF gives best performance with [14] Anna Schmidt and Michael Wiegand. A survey on hate speech detection
using natural language processing. In Proceedings of the Fifth International
0.7488 F1 Score and 0.9668 Accuracy Score.Tf-idf feature Workshop on Natural Language Processing for Social Media, pages 1 –10,
extraction model derives superlative accuracy in comparison 2017.
to bag of words model because bag of words just count the [15] Zeerak Waseem. Are you a racist or am i seeing things? annotator
frequency of words and used it as a vector but tf-idf model influence on hate speech detection on twitter. In Proceedings of the first
workshop on NLP and computational social science, pages 138 –142, 2016.
uses ratio of term frequency to the document frequency. [16] T ingxi Wen and Zhongnan Zhang. Effective and extensible feature
Limitations of this approach is that it can be only applied to extraction method using genetic algorithm-based frequency-domain feature
the twitter dataset so to detect hate speech from big data can search for epileptic eeg multiclassification. Medicine, 96(19), 2017
be a challenge. [17] Abro, S., Sarang Shaikh, Z. A., Khan, S., Mujtaba, G., & Khand, Z. H.
Automatic Hate Speech Detection using Machine Learning: A Comparative
Study. Machine Learning, 10, 6,2020.
In future f1 score and accuracy can be improved. More
machine learning techniques needs to be explored. Also
different method needs to be applied to handle the imbalance
class dataset.
Authorized licensed use limited to: Carleton University. Downloaded on June 14,2021 at 23:17:06 UTC from IEEE Xplore. Restrictions apply.