Professional Documents
Culture Documents
COLLAGE OF INFORMATICS
September, 2023
2
Keywords: Afan Oromo, Hate speech, Multi-label, Ensemble, machine learning
Table of Contents
Abstract..........................................................................................................................1
Chapter 1: Introduction..................................................................................................7
1.2. Motivation.......................................................................................................8
2.1. Introduction.......................................................................................................12
3
2.8.2. Bagging Ensemble......................................................................................32
3.4.5. Tokenization...............................................................................................46
3.4.6. Lowercase...................................................................................................46
3.4.7. Stemming....................................................................................................46
4
3.8.1. Inter Annotator Agreement (IAA)..............................................................48
3.8.4. Visualization...............................................................................................49
5
4.5.15. Experiment with LSTM...........................................................................76
4.6. Summary........................................................................................................82
5.1. Conclusions...................................................................................................84
6
List of Tables
7
Table of Figures
8
Chapter 1: Introduction
In everyday life, especially in social media, hate speech spreading is often accompanied
by offensive language[3]. Offensive language is an utterance that contains offensive
words(phrases) that are conveyed to the inter locator (individuals or groups), both
verbally and in writing[2]. Hate speech that contains offensive words/phrases often
accelerates the occurrence of social conflict because of the use of offensive
words/phrases that triggers emotions. Although offensive language is some-times just
being used as jokes (not to offend some one), the use of offensive language in social
media still can lead to conflict because of misunderstandings among citizens.
Hate speech and offensive language on social media must be detected to avoid conflicts
between citizens. Social media platforms including Facebook, Instagram, and Twitter
struggling to use artificial intelligence (AI) applications on their site to block hate speech
automatically and make safe environments for their customers. Many academics have
been studying the detection of hate speech in recent years. The majority of current studies
9
on the detection of hate speech focus on high-resource languages like English. But today,
hate speech and insults directed at specific persons or groups are frequently expressed on
social media. As a result, there has to be a system for detecting hate speech in languages
with limited resources, such as Afaan Oromo. [4],[5] proposed a machine learning
technique, and [6],[7] suggested a deep learning method for detecting hate speech in texts
written in Amharic on social media. Some scholars also researched hate speech detection
in the Afaan Oromo languages. For instance, the author[8] conducted a comparison of
Afaan Oromo hate speech detection using machine learning and deep learning. The work
of [9] introduces Afaan Oromo Hate Speech Detection and Classification on Social
Media using different Machine learning and deep learning. There are also some other
existing works on hate speech detection using machine learning[10],[11] and deep
learning[12],[13][14]. Beyond deep learning, today many researchers follow an ensemble
of deep learning models to detect hate speech[15][16][3]. Even though the ensemble
approach is best for hate speech detection, none of the researchers still works for Afaan
Oromo Hate speech detection. To fill this gap we intend to follow an ensemble deep-
learning Approach for this study.
In this research, we investigate the identification of offensive language and hate speech in
Afaan Oromo social media. Facebook ,Twitter and youtube are the social media network
in Ethiopia that are frequently used to promote hate speech and offensivelanguage, hence
we choose Facebook,twitter and youtube as our source of dataset. This is a multi-label
text classification issue where texts may contain offensive language, free speech, hate
speech politics, or religious hate speech.
To properly conduct our research on offensive language and hate speech detection, First,
a corpus of comments and posts were retrieved from Facebook, Twitter and
youtube.After that, features are extracted using word embedding techniques such as
Word2Vec[17]. Using machine learning models, such as support vector machine
(SVM),Logistic regression(LR),Random Forest(RF),and deep learning models including
convolutional neural networks (CNN) ,long short-term memory networks (LSTM[18],
gated recurrent unit(GRU), and bidirectional long short-term memory networks (Bi-
LSTM[19]) are employed in the detection of hate speech. We then implement an
10
ensemble of deep learning models based on the individual classifiers. We hypothesize
that the combination of multiple classifiers will lead to more accurate performance
results.
1.2. Motivation
As social media users, we saw that many people in Ethiopia now release hate speech on a
regular basis, especially against particular ethnic groups. In Ethiopia, the issue even
challenged the government, and multiple times the Ethiopian government attempted to
control the effects by interrupting internet connectivity[20]. Despite their efforts to build
AI for hate speech detection, social media companies like Facebook and Twitter continue
to face a difficult problem: social media hate speech. A recent declaration for the
prohibition of hate speech from the Ethiopian government requires social media service
providers to delete hate speech content from their sites. This is due to the fact that any
hate speech on social media might result in significant conflict in actual society. The
motivation behind this study is to address the gap in hate speech detection for the Afaan
Oromo language. We want to increase the efficiency of hate speech detection models and
help create a safer online environment by creating an ensemble deep learning technique.
We believe that by leveraging the power of ensemble models, we can enhance the
accuracy and robustness of hate speech detection, enabling proactive interventions and
promoting inclusivity, tolerance, and respectful dialogue.
Despite the progress made in hate speech detection using deep learning techniques, the
performance of individual models is often limited by factors such as bias in training data,
11
noise in the input, and overfitting. To address these issues, researchers have explored the
use of ensemble deep learning techniques, which combine multiple models to improve
the overall accuracy and reliability of hate speech detection. The efficiency of ensemble
deep learning approaches for the detection and categorization of Afaan Oromo hate
speech is still not well studied, though. We thus suggested an ensemble deep learning-
based hate speech detection system for Afaan Oromo to fill this need.
2. What is the optimal architecture and configuration of an ensemble deep learning model
for Afaan Oromo hate speech detection?
3. How does the proposed ensemble deep learning approach compare to individual deep
learning models in terms of accuracy, robustness, and generalization capability for Afaan
Oromo hate speech detection?
1.6. Objectives
12
1.6.2. Specific Objectives
To achieve the aforementioned general objectives the following specific tasks have been
performed:
To prepare a hate speech dataset using Afaan Oromo text on social media pages
and used for training and testing purposes.
To implement annotation rules for labeling posts and comments.
To develop an ensemble deep learning-based hate speech detection model that
combines multiple models to improve the accuracy and robustness of hate speech
detection.
To compare the performance of the ensemble model with individual deep learning
models commonly used for hate speech detection.
To investigate the contribution of individual models in the ensemble and analyze
the impact of different ensemble techniques on hate speech detection
performance.
In this work, some machine learning models and deep learning model variants such as
CNN, Bi-LSTM, LSTM, and GRU are trained, and their output is combined to enhance
the performance of the models. The dataset is divided into five categories: free speech,
hate speech motivated by religion, hate speech motivated by politics, hate speech
motivated by race, and offensive language.
13
1.7.2. Limitation of the study
Due to the complexity of the task of data annotation and also because we have a limited
number of annotators, we are unable to create a large dataset. The other big issue is
concerning the computational resource which prohibited us from carrying out more GPU-
intensive processes like automatic hyper parameters tunings, as we got a low-end GPU.
We were also prevented from creating models for the detection of hate speech in non-
textual data, such as audio and video, due to a lack of time and dataset.
14
Chapter 2: Literature Review and Related Work
To further understand the topic and explore the research challenge, this chapter explores
pertinent literature. The paper includes a definition of hate speech, applications for hate
speech detection systems, and methods for currently identifying hate speech. Definitions
of hate speech and social media are provided first. The chapter then reviews hate speech
on social media, provides an introduction to the Afaan Oromo, discusses hate speech
detection methods, and reviews relevant literature on social media hate speech detection.
Definition 2: Facebook defines hate speech as "object able content that directly attacks
people based on what we call protected characteristics such as race, ethnicity, national
origin, religious affiliation, sexual orientation, caste, sex, gender, and gender identity."
Additionally, we offer various safeguards based on immigrant status. Attack is what we
characterize as "violent or dehumanizing speech, inferiority statements, or calls for
exclusion or segregation."
15
Definition 3: According to the European Court of Human Rights[23], hate speech shall
be deemed to include all expressions which spread, incite, promote, or otherwise justify
racial hatred, xenophobia, anti-Semitism, or other forms of hatred based on intolerance,
including intolerance expressed through aggressive nationalism and ethnocentrism,
discrimination against minorities, migrants, and people of immigrant origin”.
16
sharing of their thoughts and opinions. However, there are also a number of frightening
repercussions, such as online harassment, trolling, cyberbullying, and hate speech.
In Ethiopia, hate speech on social media has become a major problem with the increase in
social media users in the country. The lack of legal laws or recommendations that
indirectly define or tackle hate speech makes the problem more difficult. However, there
is a law that is indirectly used for hate-related issues, such as anti-terrorism law, the law
forbids “the use of any telecommunication network or apparatus to broadcast any
terrorizing information” or “obscene message” obscene message’, which includes the use
of any social media platform and other forms of commutation platforms to disseminate
terrorizing messages. The subject had violated a prison sentence for up to eight years.
However, the law has been used to limit messages or speeches that criticize government
policy and officials. This law has received a reaction from national and international
organizations and the academic community because its use contradicts the freedom of
speech of human rights law. Currently, law-maker, government officials, and politicians
in Ethiopia have been aware of hate speech on social media and lawmakers drafted a new
hate speech and fake news law on the march 23 of 2020 by the proclamation cited as
“Hate Speech and Disinformation Suppression and Prevention Proclamation No.1185
/2020”[24].
NLP is a collection of algorithms and methods used to extract meaning and grammatical
structure from input natural language to perform useful tasks like the generation of
natural language that helps to build output depending on the rules of the target language,
17
which is spoken by people from a specific region and the specific task [6]. Due to its
ability to foster more engagement and productivity, NLP is valuable in the disciplines of
database interface, duplication detection, computer-supported teaching, and tutoring
systems [25]. The techniques of NLP are developed in such a way that the commands
given in natural language can be understood by the computer and also be able to perform
according to it. It should be noted that natural language processing can be divided into
two parts, namely written and spoken language. The 'levels of language' technique is the
most illustrative way to explain what happens within a Natural Language Processing
system. [27]. People utilize these levels to decode spoken or written languages and
determine their meaning. This is due to the fact that language processing primarily uses
formal models or representations of information pertaining to various levels.
Additionally, by utilizing the language's expertise, language processing applications set
themselves apart from data processing systems. The analysis of natural language
processing has the following levels: Phonology, Morphology, Lexical, Syntactic,
Semantic, Discourse, and Pragmatic. Afaan Oromo is one of the natural languages on
which we can do various research on it using NLP.
18
differences and inconsistencies among scholars and researchers regarding the number and
categorizations of the dialects[32].These are Western (Wellega, Iluababor, Kaffa, and
parts of Gojjam), Eastern (Harar, Eastern showa, and arts of Arsi and Bale), Central
(Central Showa, Western Showa, and possibly Wollo), and Southern (Parts of Arsi,
Sidamo, and Borena).
19
Table 1:The Qube international phonetic writing[29]
The way adjectives are formed in Afaan Oromo and English is another distinction. In
Afaan Oromo, adjectives are typically used after the noun or pronoun they modify. They
are also commonly used in close proximity to the noun. On the other hand, in English,
adjectives are frequently used before nouns. For instance, in the phrase "ilma gaarii" (nice
boy), the adjective "gaarii" follows the noun "ilma")[34].
Apart from apostrophes, the punctuation marks utilized in Afaan Oromo and English
serve the same purposes and are similar. In Afaan Oromo, the apostrophe mark (’) is
employed in writing to represent a specific sound called "hudhaa," whereas, in English, it
is used to indicate possession. The apostrophe mark (’) in Afaan Oromo plays a
significant role in both reading and writing systems. Punctuation marks are employed in
the text to enhance clarity of meaning and facilitate reading. Afaan Oromo follows the
20
punctuation patterns observed in English and other languages that adopt the Latin writing
system.
21
understand the data and help users in decision-making in their day-to-day activities.
There are two types of machine learning: supervised and unsupervised.
Supervised learning: This kind of machine learning (ML) is learned using labeled data. It
makes use of a labeled dataset with matched sets of observed inputs (Xs) and the
corresponding outputs (Ys). The dataset is subjected to a machine learning algorithm to
infer patterns between inputs and outputs.
Unsupervised Learning: In the case of an unsupervised method, we need to let the model
run on its own to uncover knowledge rather than utilizing labeled data. It is employed to
train a model on a dataset with just inputs and analyses the data to give it meaning and to
organize or categorize it. By giving the data some structure, it leverages unlabeled and
uncategorized data to provide recommendations to consumers. It uses input data without
output data to deduce conclusions from the dataset. However, it lacks labeled outputs,
therefore its objective is to deduce the natural organization that exists inside a collection
of data points.
The concept of learning by positive and/or negative feedback underlies the category of
reinforcement learning. It seeks to train a model to identify the most advantageous series
of choices (policy) for resolving a certain issue. Decisions that are successful are
rewarded, while those that are unsuccessful are penalized. In addition to those three
classifications, a lot of well-known algorithms make use of semi-supervised learning,
which as its name implies combines supervised and unsupervised learning. Here, an ML
model is trained using both labeled and unlabeled examples. The bag of words, words,
and character n-gram features are the most successful surface features in the Machine
technique for classifying hate speech. But when it comes to classifiers, we discovered
that SVM, Random Forests, Decision Trees, Logistic Regression, and Naive Bayes are
the most often employed algorithms. Below is a discussion of the most popular machine-
learning techniques for detecting hate speech.
22
for classification and applies the Bayes theorem to the computation. An effective method
for classification issues is the Naive Bayes algorithm. Real-time prediction, multi-class
prediction, recommendation systems, text classification, and sentiment analysis use cases
are all best suited for this approach[35].
23
one in which each node represents a test on an attribute, each branch indicates the test's
result, and each leaf node serves as a class label. In this method, the class or value of the
target variable was predicted using simple rules using training data. In further detail, the
record began at the decision tree's root and evaluated each branch's attribute to the node
attributes before forecasting a final class label in the leaf node
Figure 1:An illustration of the position of deep learning (DL), comparing with machine
learning (ML) and artificial intelligence (AI)[36]
Deep learning also refers to data-driven learning techniques that use multi-layer neural
networks for computing and processing. In the deep learning approach, the word "Deep"
alludes to the idea of numerous levels or stages of data processing before a data-driven
model is created. Deep learning is a machine learning technique that, in general, enables
computers to learn and perform tasks that people would naturally be able to perform.
They can be taught in supervised, unsupervised, or semi-supervised methods, just like
machine learning algorithms.
24
Deep learning techniques Utilise deep artificial neural networks to classify hate speech by
learning abstract feature representations from incoming data through their numerous
stacked layers[37]. The input might be either the raw text data itself or any of the feature
encoding formats that are utilized in the traditional approaches. The main distinction is
that in such a model, the input characteristics could not be employed immediately for
categorization. Instead, new abstract feature representations that are more successful at
learning may be learned from the input using the multi-layer structure. Due to the
network structure's meticulous design, deep learning-based approaches often move their
attention from manually designing features to automatically extracting valuable features
from a basic input feature representation. Indeed, there is a significant movement in the
literature towards the use of deep learning-based techniques, and studies have
demonstrated that they outperform conventional techniques on this task. Nowadays, deep
learning models excel in text analytics and hate speech detection tasks in particular
because of their reliance on deep learning neural network classifiers. It makes sincere
efforts to learn how to recognize patterns in the given text and tries to replicate the event
in layers of neurons. The effectiveness of deep learning models depends on the hyper-
parameters and neural network algorithm of choice, as well as feature representation
methods.
25
Figure 2:Architecture of Deep learning Neural Network[7]
Due to their great accuracy, deep learning algorithms have recently received a lot of
attention in the text classification problem. For the classification of texts, the following
deep learning methods are employed:
Given input state x t , previous stateht −1 , new hidden state and output at time step t are
computed as:
26
ht=σh(Whxt +Uhht−1+bh)
yt=σy ¿
Where xt = is the input vector at a time step, ht is the hidden layer vector,
and yt is the output vector at time step t.
W, U, and b are parameter matrices and vectors.
σh , σy are activation functions.
In RNN hidden layers are recurrent layers, where every neuron in the hidden layer is
connected. The hidden layer takes input from both the input layer xt and the hidden layer
from the previous stateht−1. Recurrent neural networks are intended for modeling
sequences and are capable of remembering prior knowledge, in contrast to feed-forward
neural networks, which receive fixed-size vectors as input and produce fixed-size vector
outputs.
However, short-term memory is a challenge for recurrent neural networks (RNNs). They
will struggle to transfer the information from earlier time steps to later ones if the input
sequence is sufficiently lengthy. This issue was resolved by LSTM and GRU.
27
methods for back propagation in the bidirectional LSTM. From the front and the back,
respectively. Because of this procedure, Bi-LSTTM is an effective tool for studying
textual data[38]. The Bi-LSTM architecture is depicted as follows:
28
2.7.6. Convolutional Neural Network (CNN)
A well-liked discriminative deep learning architecture, the convolutional neural network
(CNN), learns directly from the input without the requirement for manual feature
extraction. As a result, CNN improves the construction of conventional regularized MLP
networks that resemble ANNs. Each layer of CNN decreases model complexity while
taking into account the ideal parameters for a meaningful output. Additionally, CNN
employs a "dropout" to address the issue of over-fitting that may arise in a conventional
network.
Since CNNs are specifically made to handle a variety of 2D shapes, they are commonly
utilized in visual identification, medical image analysis, image segmentation, natural
language processing, and many other applications. Since it can automatically detect
important components from the input without requiring human involvement, it is more
effective than a normal network. According to their learning capacities, the region's many
CNN variants, such as visual geometry group (VGG), AlexNet, Xception, Inception,
ResNet, etc., may be applied in a variety of application fields[36]. By utilizing word2vec
pre-trained vectors as the primary tools for vector representation of words and character
n-grams, CNN a standard multilayer neural network is discovered as a promising solution
for hate speech identification issues in social media datasets. By preserving key elements
of various pooling strategies that might aid to lower outputs, CNN employs pooling to
minimize the complexity with reference to computing resources by reducing the size of
the output from the first layer to the next layer in neural networks. CNN has been
successfully utilized for hate speech identification despite being designed for image
processing. [6]. In general, it is a crucial technique for classifying sequential, text, and
string data.
The process of obtaining valuable and distinctive qualities, known as features, from input
data (in this case, text) is referred to as feature extraction. These characteristics are
29
gathered to improve how well deep learning systems perform on learning and
classification tasks. Following extraction, a subset of characteristics often holds more
useful data[6].
Term Frequency (TF): Given a document (d), frequency is the number of times a certain
term (t) appears in that document. Therefore, it makes sense that when a term occurs in
the text, it becomes more relevant.
Inverse Document Frequency (IDF): The primary function of this IDF is to evaluate a
word's relevance. Finding the pertinent records that meet the requirement is the search's
primary goal. Since tf accords all words equal importance, it is also feasible to utilize
term frequencies to gauge a term's weight in the article.
30
Total number of documents
IDF ( t )=log e ( )
number of document withterm t ∈¿
31
Figure 4: An illustration of a continuous bag of the word and Skip gram
word embedding model
A paradigm known as GloVe, or Global Vectors for Word Representation, uses word co-
occurrences to create geometric encoding. It is feasible to obtain word embeddings from
the dataset using GloVe, which captures the semantic connections between words[6]. A
paradigm known as GloVe, or Global Vectors for Word Representation, uses word co-
occurrences to create geometric encoding. It is feasible to obtain word embeddings from
the dataset using GloVe, which captures the semantic connections between words.
Various levels of semantic granularity are captured by the various dimensions (25, 50,
100, 200, and 300). The resultant features, referred to as GloVe features, are obtained by
adding the GloVe vectors for each distinct phrase in a text, per dimension.
Ensembles of deep learning methods have emerged as a powerful approach for hate
speech detection, leveraging the strengths of multiple models to improve performance
and robustness. Hate speech detection is a challenging task that aims to automatically
32
identify and classify text or speech that contains offensive, discriminatory, or harmful
content. Deep learning techniques, such as neural networks, have shown promising
results in this area due to their ability to capture complex patterns and dependencies in
textual data.
In the context of hate speech detection, ensembles of deep learning methods can be
constructed using various architectures such as recurrent neural networks (RNNs),
convolutional neural networks (CNNs), and transformers. Each individual model within
the ensemble may have different architectural variations, hyper parameters, or pre-trained
embeddings, ensuring a diverse set of classifiers.
Ensemble methods offer several advantages for hate speech detection. Firstly, they can
enhance the generalization ability of the system by reducing overfitting and capturing
different aspects of hate speech. Each model within the ensemble may focus on different
linguistic features or contextual cues, increasing the overall coverage and robustness of
the system. Secondly, ensembles can help mitigate the bias present in individual models.
Hate speech detection is a complex task that can be influenced by various factors,
including cultural, social, and linguistic biases. By combining the predictions of multiple
models, ensembles can reduce the impact of biased decisions made by individual
classifiers and provide a more balanced and fair classification outcome.
Furthermore, ensembles can improve the overall performance metrics of hate speech
detection systems. By leveraging the strengths of multiple models, ensembles can achieve
higher accuracy, precision, recall, and F1-score compared to individual classifiers. This is
particularly important in real-world applications where the consequences of
misclassifying hate speech can be severe.In this section, we give an overview of some of
the most popular ensemble algorithms.
33
2.9.1. Averaging Ensemble
The simplest method for combining the predictions of multiple models is averaging
method[43]. It is a widely used ensemble technique in which each model is trained
separately, and the averaging technique linearly integrates all predictions of models by
averaging them to produce the final prediction. This technique is simple to apply without
needing extra training on huge numbers of individual predictions. Usually, voting is the
standard way for averaging the prediction of the baseline classifiers. The final prediction
results are usually determined by a majority vote on the predictions of many classifiers,
which is referred to as hard voting.
34
training set. The predictions generated by each individual model are then aggregated to
form the Level-1 training set. It is important to note that, to prevent overfitting the meta-
learner, the data samples used for training the baseline classifiers must be excluded when
training the meta-learner. Therefore, the dataset needs to be split into two distinct parts.
The first part is used to train the base-level classifiers, while the second part is used to
construct the meta-dataset.
Some Research has used ML and DL models to Detect hate speech. For example, the
study in [4] developing hate speech identification for Afaan Oromo social media is
critical to eliminating the risk of hate speech on social welfare. They ran six experiments
using ML approaches like support vector machine (SVM), multinomial Nave Bayes
(MNB), linear support vector machine (LSVM), logistic regression (LR), and random
forest (RF) classifier to build hate speech detection prototypes for Facebook and Twitter.
Despite the fact that they constructed the Afaan Oromo hate speech detection model
using ML methods and data from the Facebook and Twitter networks, the study only
looked at posts and comments in textual documents. Posts and comments including
images or photos, audio or video data, or both have not been evaluated. To evaluate
performance, researchers used performance parameters such as accuracy, precision,
recall, and f1-score. Bigram and term frequency-inverse document frequency (TF-IDF)
ML feature selection techniques were utilized. The results show that LSVM has the best
performance. As a result, the researchers agreed to implement the Afaan Oromo hate
speech detection model using LSVM. The most significant limitation of this work is the
use of traditional ML algorithms that need manual labeling of the dataset. The data
experiments were minimal in scale. Going beyond traditional ML methodologies for
experiments, according to those researchers, could be the next study. As a result, we
intend to use deep learning algorithms in this study.
35
Another author[9] created a number of models that were used to detect and categorize
Afaan Oromo hate speech on social media by combining several machine learning
algorithms with feature extraction techniques like as BOW, TF-IDF, word2vec, and
Keras Embedding layers. They were able to collect a total of 12,812 posts and comments
from Facebook by focusing on four thematic categories of hate speech, such as gender,
religion, race, and offensive speech. The following is how the author generalizes his
work: Bi-LSTM with pre-trained word2vec feature extraction is a superior approach, with
accuracy scores of 0.84 and 0.88 for eight and two classes, respectively.
The author of [8] conducted a comparison of deep learning-based Afaan Oromo hate
speech detection methods. They begin by extracting a corpus of Facebook and Twitter
comments and posts using word n-grams and word embedding methods such as
Word2Vec. There were 35,200 posts and comments in all. The entire dataset size was
expanded to 42,100 by using the data augmentation approach. Convolutional neural
networks (CNN), long short-term memory networks (LSTM), bidirectional long short-
term memory networks (Bi-LSTM), GRU, and CNNLSTMs are then employed for hate
speech detection. The experiment results show that the model developed with CNN and
Bi-LSTM has the greatest weighted average F1-score of 87%. The author suggests that
future research look into the performance of classifier ensembles and meta-learning for
this task. Furthermore, the performance is not perfect, which means that users may
encounter misclassified content. As a result, we used classifier ensembles and meta-
learning to boost performance.
Author[40], developed an RNN-based automatic Amharic hate speech post and comment
detection algorithm. The author used 30,000 datasets, with 80% used for training, 10%
used for validation, and 10% used for testing. The author employed two deep learning
algorithms (LSTM and GRU) to extract word n-gram and word2vec features. Finally, the
author obtained the greatest accuracy of 97.9% using the LSTM algorithm and word2vec
feature extraction techniques. However, the author provides binary classification classes,
which is insufficient. Furthermore, the author has only employed a few Algorithms
(LSTM and GRU).
36
A study introduced in[5] developed an Apache spark-based model to classify Amharic
language Facebook posts and comments into hate and not hate. They employed Random
Forest and Naive Bayes as learning algorithms and Word2Vec and TF-IDF for feature
extraction using 6,120 (4,882 to train the model and 1,238 for testing). In their
experiment, Naive Bayes and Random Forest outperform with an accuracy of 79.83%
and 65.34% with the word2vec feature vector modeling approach respectively. However,
they recommended expanding the classification category with different aspects of hate
and increasing the corpus size including other sources.
[46] An Italian online hate campaign was suggested on a social network, utilizing data
collected from public Facebook pages in Italy. The collected dataset was categorized into
three classes: "no hate," "weak hate," and "strong hate." To create a second dataset, the
"weak hate" and "strong hate" classes were combined into a single "hate" class. The
author developed and implemented two classifier algorithms, SVM and LSTM,
specifically designed for the Italian language. These algorithms incorporated morpho-
syntactical features, sentiment polarity, and word embedding lexicons. Two separate
experiments were conducted using both datasets, ensuring at least 70% agreement among
annotators regarding the data's classification. The SVM algorithm achieved an F1-score
of 80% for binary classification, while the LSTM algorithm achieved a slightly lower F1-
score of 79%. For ternary classification, SVM and LSTM achieved F1-scores of 64% and
60% respectively. The study only utilized conventional SVM and LSTM models, but
employing additional deep-learning models could potentially enhance the classification
performance.
[47] Proposed Cyber hate speech detection based on Arabic context over the Twitter
platform, by applying NLP and machine learning techniques. The work focused on a set
of tweets related to sports, racism, terrorism, journalism, sports orientation, and Islam.
The processed dataset is experimented with using Decision Tree (DT), Support Vector
Machine (SVM), Naive Bayes (NB), and Random Forest (RF). In their experiment, RF
with TF-IDF and profile-related features achieved a better result with an accuracy of
91.3%. Since hate as a term is subjective and can be expressed in a wide range of areas
37
not restricted to the sport, religious or racial issues they recommended further work for
the more generalized dataset and effective detection models.
Likewise, the deep learning approach for automatic cyber hate speech detection on
Twitter is presented by [47]. The dataset for the study was collected from Twitter and the
collected data captures different hate expressions in the Arabic region. The authors used
word embedding mechanisms for feature extraction. The hybrid of CNN and LSTM
networks is implemented for model development. The proposed approach aimed to
classify tweets as hate and normal and achieved promising results, 66.564%, 79.768%
65.094%, and 71.688% regarding the accuracy, recall, precision, and F1 measure
respectively. The study is limited to binary classification while it is important to
distinguish offensive expressions from hate speech. The study also recommended a more
standardized dataset and high-performance deep learning approaches.
Gambäck et al. [48] present a deep learning-based hate speech classification system for
twitter using a dataset prepared by Benikova et al. [69] that has four class categories
sexism, racism, both (sexism and racism), and not hate speech. Using four features
embedding like word2vector, character n-grams, Random vector, and combined character
n-grams with word vectors. These four features embedding with deep learning CNN, the
models tested by using 10-fold cross-validation, their best-performing model uses
word2vec embedding with higher precision and recall, with a 78.3% F-score.
Several studies have demonstrated the effectiveness of ensemble deep learning in hate
speech detection. For instance, Smith et al. (2020) developed an ensemble model by
combining the predictions of CNNs, LSTMs, and BERT models, achieving superior
performance compared to individual models. The ensemble model effectively captured
both local and contextual features, enhancing its ability to detect subtle nuances of hate
speech.
Furthermore, Gupta and Rajput (2021) proposed a stacked ensemble model that
combined predictions from multiple CNNs and LSTMs. The model leveraged the
complementary strengths of the base models, leading to improved detection accuracy and
robustness across diverse hate speech datasets.
38
Table 3: Summary of related work
39
not for Multi-
purpose.
The
performance of
the model is
not good
enough. The
author did only
binary
classification.
[40] Deep n-gram and 30,000 Amhari proposed hate This work is
learning word2vec datasets c speech limited to hate
approach(L detection speech in the
STM and model for the Amharic
GRU) Amharic language and
language also binary
classification
problems.
Only done
LSTM and
GRU model.
[48] Deep word2vec 6,655 English They apply This study
learning( total CNN on the produced a
CNN) datasets problem of small data set
multi-class of 6655 which
classification: was biased and
racism, sexism, needed class
both (racism balancing.
and sexism),
and neither
[20] Machine Word2Vec 6,120 Amhari proposed the This work is
40
learning(Ra and TF-IDF posts c application of limited to hate
ndom apache spark in speech
forest and hate speech detection in
Naïve detection to Amharic.
Bayes) reduce the
Challenges
[2] Machine word n- 16,500 Indones They proposed The author
learning gram, posts ian a model that only applies
Approach( character detects conventional
RFDT,NB, n-gram, Target, machine
SVM) orthography category, and learning
lexicon level of hate algorithms.
speech.
Table 4:Summary of Related works
41
Chapter 3: Methodology
3.1 Introduction
This chapter describes the methodologies that we used to accomplish this research,
including methods to implement the models, literature review,data collection,data
preparation, software and hardware configuration of the system used, and evaluation
techniques used to evaluate the model.
Diagram
3.3. Methodology
42
Related literature must be reviewed to understand hate speech detection systems. To
obtain resources, we performed the following tasks. We reviewed relevant local and
international journal articles, conference papers, books, and resources on the Internet
related to hate speech detection based on textual data as well as machine learning
techniques to gain a conceptual understanding and identify research gaps in the study.
The Afaan Oromo text dataset for hate speech detection, which is the main focus of the
analysis in this paper, was retrieved from comments and posts published on Facebook,
Twitter, and YouTube from January 2023 to May 2023.
This work targets Facebook pages, Twitter accounts, and YouTube, which are open to
suspected hate speech, rather than focusing on websites or blogs that already have
specific agendas. In Ethiopia, it is common for social network communities to post on
political and religious issues. Whereas many users use a different language to create or
share information, only Afaan Oromo data will be considered in this case. All posts and
comments collected from different pages should be from Afaan Oromo. In the data
collection process, posts and comments were collected using Face Pager, a Facebook API
application that can download posts and comments from the desired page in the CSV file
format. The following features of Hate Speech were included in the dataset:
Free speech
Race
Religious
Politics
Offensive
43
4.5. 3.2. Data source
The data source for building the Afaan Oromo hate speech and offensive language
dataset is Afaan Oromo social media pages, which have more followers. The social
media pages are randomly selected from the higher numbers of followers because a social
media page with a higher number of followers is believed to have different users with
different point views as well as from well-known Afaan Oromo media. Several sampling
metrics or criteria can be used to select pages or users on social media platforms. This
study chose the following metrics or criteria for selecting a public page:
A page that posts news or comments on issues of religion, ethnicity, politics, and
gender calls for violence daily.
The number of likes and followers of a page must be greater than 10,000, which
allows for more active public pages.
A page that uses the Afaan Oromo language most frequently for posts and
comments.
De pending on the above page selection metrics, we collect data from different public
pages
The summary of pages that were utilized to build the corpus is provided in Table 1.
Those pages listed in Table 1 typically post discussions on political, social, economic,
religious, and environmental issues that took place in Ethiopia. In total, 35,000 posts and
comments were collected. To remove the noise from the data set, rigorous preprocessing
was carried out, which resulted in the removal of HTML, URLs, tags, emoticons, and
other language scripts.
Based on these criteria, the selected pages and the number of data collected from each
page are given in the table below:
44
4 Taye Dendea Aredo 6200
5 BBC Afaan Oromo 4560
6 Oromia Media Network 6000
7 Jawar Mohammed 4600
8 VOA Afaan Oromoo 4500
Total number of data filtered 37,244
Total Number of Unique Data Filtered After 35,000
Removing Redundancy
The annotation task for the multilabel hate speech dataset involved assigning multiple
labels to each data instance from a set of five classes: free speech, religion, race, politics,
and offensive content. To complete this task,four annotators, including the researcher,
were involved. Each annotator was responsible for reviewing and annotating the texts and
considering all relevant classes that were applied to each instance. The annotators'
selection was based on their background knowledge and expertise in analyzing hate
speech. The dataset was comprehensively labeled with appropriate class labels by
leveraging the combined efforts of these four annotators, allowing for a more nuanced
understanding of hate speech across multiple dimensions. The annotation task was
performed following the provided annotation guidelines. The annotation guidelines
served as a set of instructions and criteria to determine how each instance in the dataset
should be labeled. To ensure consistency among annotators, they were provided with the
following guidelines (or rules) for each annotator: A post has been marked as Politics,
Race, and Religion-related speeches:
45
If at least one of the following criteria (1-5) is achieved, the annotators classify the
sentence (speech) into politics, religion-class, and race-class hate.
If the above criteria are fulfilled the class dataset is labeled as (politics, religion, and race)
class hate speech, otherwise labeled as neutral speech.
Offensive languages class: If at least one of the following criteria (1-4) is achieved, the
annotators can classify the sentence (speech) into offensive-class hate speech.
If the post or comment contains a common insult without specifying the other 3 classes
above.
If the post or comment contains insult but it may or may not promote to take a violent
attack on an individual or group.
If the post or comment contains an upsetting word but the particular subject of the
sentence is unknown (sentence having hidden subject/noun/pronoun).
Depending on the above guide lines we assigned multiple labels to each data instance
from a set of five classes.Detailed statistics of the balanced Afaan Oromo dataset
46
categorized into 5 classes (i.e. Free, Race. politics, Religion, and Offensive Language)
are provided in Table 2.
47
start with text normalization. Text normalization includes:
Convert text to lowercase: - Lowercasing all Afan Oromo text data, although
commonly overlooked, is one of the easiest and most useful forms of text
preprocessing. It is applicable to most text mining and NLP problems and can
help in cases where your corpus is not very large and meaningfully helps with
consistency of expected output. For example, if trainee a word embedding model
for similarity lookups. It is easy to found that different variation in input
capitalization (e.g. ‘Paartii’ vs. ‘paartii’) gave different types of output or no
output at all. This was probably happening because the dataset
48
had mixed case occurrences of the word ‘Paartii’ and there was insufficient
evidence for the neural- network to effectively learn the weights for the less
common version. This type of issue is bound to occur when the dataset is
honestly small, and lowercasing is a great way to deal with sparsity issues.
Remove numbers: - numbers are removed if they are not relevant to the
analyses. Typically, regular expressions are used to remove numbers.
Remove Punctuation: - Punctuation also will be removed. Punctuation is
basically the set of symbols [!” #$%&()*+, ./:;<=>?@[\]^_{|}~]:
Remove HTML Tags: - since, our dataset is web scraped, there is chances
that our dataset will contain some HTML tags. Since these tags are not useful
for our natural language processing tasks, it is better to remove them.
Tokenization
In order to get our computer to understand any Afaan Oromo text, we need to
break that word down in a way that our machine can understand. That’s
where the concept of tokenization in Natural Language Processing comes in.
As tokens are the building blocks of Natural Language, the most common
way of processing the raw text happens at the token level. For example,
Transformer based models the State of The Art neural network architectures
in Natural Language Processing (NLP) process the raw text at the token
level. Similarly, the most popular neural network architectures for NLP like
RNN, GRU, and LSTM also process the raw text at the token level. Hence,
tokenization can be broadly classified into three types’ word, character, and
sub word (n-gram characters) tokenization [27]. In our work, word level
1
tokenization is used.
Word Tokenization: - it is the most commonly used tokenization algorithm.
It splits a piece of a given text into individual words based on a certain
delimiter. Depending upon delimiters, different Afaan Oromo word level
tokens are formed. Pre-trained Word Embedding such as Word2Vec and
GloVe comes under this type of tokenization. `
3.5. Splitting Dataset
For the experiment, the dataset is divided into two sets, i.e., Training Set and Testing
Set. In this research, we split our dataset into Training Set and Testing Set with a ratio
of 80:20, respectively. The training set is used to train and optimize models. The
testing set (unseen set) is used to evaluate models.
T
1
∑ log p ( wt ∨¿wc )
t t=1
Where w in Eq. 1 is the target word, and wc represents the sequence of words in
context. Word2Vec model can be implemented in two ways: (1) pre-training and
2
using it as an input layer at the beginning of the model architecture, or (2) training it
with the model itself.
3
data processing tools because python is the choice of developers, researchers, and data
scientists.
Implementation tools
Tools Version Description
Anaconda 1.9.12 It is a graphical user interface (GUI) of desktop that is
navigator included in Anaconda® distribution to allow us to manage
conda packages easily, launch applications, environments,
and channels without using command-line commands.
Jupiter 6.0.3 An open-source web application that allows us to create and
notebook share documents that contain visualizations, live code,
equations, and narrative text. Uses include numerical
simulation, data visualization, statistical modeling, data
cleaning, and transformation.
Python 3.7.0 A general purpose Programming language that is suitable for
deep learning algorithm implementation.
Notepad++ 7.8.6 It is an editor of source code and normal texts to use it on
Microsoft Windows. It supports tabbed editing and we use it
for data preparation and to manage the data annotation
process of annotators.
Pytorch 1.4.0 It is a library for an optimized tensor of deep learning using
CPU and GPU which is an open-source library of machine
learning used for developing and training the neural
network-based deep learning models. We use it to train our
RNN based hate speech text detection model.
NumPy 1.18.1 Array processing for number, strings, and objects. We use it
to handling our data set features for training and testing of
the model.
Matplotlib 3.1.3 Publication quality figures in python. We use it for data
visualization.
4
Deployment Environments
The tools used for implementation were installed on a personal computer DESKTOP-
JPLMC78 equipped with Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz 1.70 GHz,
6.00 GB, 64-bit operating system, x64-based processor.
5
Chapter 4: Results and Decisions
In this chapter four, the result of each proposed model will be presented. Data source,
data preprocessing and data labeling is discussed. Next to that, the experiment and
evaluations of each model will be discussed.
6
This corpus contains Afan oromo sentences with appropriate collections of hate
speech from different sources. The diagram provides information about the data: the
total number of sentences and words.
Normalization
Normalization is the process of transforming text into a standard format that can be
easily analyzed and compared. In natural language processing, normalization is often
used to correct misspelled words, remove slang words, or standardize the
representation of words with different spellings but the same meaning.
7
In the case of homophones like "baayee" and "baayyee", normalization can be used to
represent both words in a consistent and standardized way. This can improve the
accuracy of natural language processing models that rely on text data by reducing the
number of unique representations of the same word.
Additionally, normalization can help to improve the readability and understandability
of text by removing noise and irrelevant information.
Overall, normalization is an important step in preparing text data for analysis and can
help to improve the accuracy and effectiveness of natural language processing models
Stop word removal
The use of stopword removal is to improve the accuracy and efficiency of natural
language processing (NLP) tasks such as text classification, sentiment analysis, and
information retrieval.
Stopwords are words that are common in a language and do not carry much meaning,
such as "akkuma", "kan", "osoo", "yoom", "fi", "naaf", "hanga", etc. These words
appear frequently in text and can clutter the dataset without adding much value to the
analysis.
By removing stopwords from text data, we can reduce the size of the dataset and
improve the accuracy of NLP models. This is because stopwords can distort the
frequency of important words in the dataset, making it more difficult for the model to
identify meaningful patterns. Stopword removal can also improve the efficiency of
NLP tasks by reducing the processing time required to analyze text data. This is
because removing stopwords reduces the amount of text that needs to be processed,
making the analysis faster and more efficient.
Define a list of stopwords for Afan Oromo language
Tokenize the input text into individual words
For each word in the list of words:
If the word is a stopword, remove it from the list of
words
If the word is not a stopword, keep it in the list of
words
Join the filtered words back into a single string and return it
8
Overall, the use of Stopword removal is an important step in preprocessing text data
for NLP tasks, as it can improve the accuracy and efficiency of the analysis
Accuracy = (True Positive + True Negative) / (True Positive + False Positive + True
Negative + False Negative) ---------------------------------------------------------(4.1)
Precision: is the ratio of true positive cases over the total number of cases predicted
as positive. It measures how many of the predicted positive cases are actually positive
(Islam, Mercer, & Xiao, 2019). The formula for precision is:
Recall is the ratio of true positive cases over the total number of actual positive cases.
It measures how many of the actual positive cases were identified as positive by the
model (Islam, Mercer, & Xiao, 2019). The formula for recall is:
F-score: is a measure that combines both precision and recall into a single metric. It is
the harmonic mean of precision and recall. The formula for F-score is:
F-score is a useful metric when precision and recall are equally important, and you
want to balance between them.
9
negatives, and false negatives. The rows of the matrix represent the actual (ground
truth) values, and the columns represent the predicted values. The confusion matrix is
used to calculate the accuracy, precision, recall, and F-score. The following is an
example of a confusion matrix:
In the above Table 4-1 matrix, True Positive represents the number of cases where the
model predicted positive, and the actual value was positive. False Positive represents
the number of cases where the model predicted positive, but the actual value was
negative. False Negative represents the number of cases where the model predicted
negative, but the actual value was positive. True Negative represents the number of
cases where the model predicted negative, and the actual value was negative (Islam,
Mercer, & Xiao, 2019).
4.5.Experiment Results
After pre-processing, we convert the text data into numerical features using TF-IDF.
We split the dataset into training and testing sets using an 80/20 split ratio. We then
train a SVM multi-label classifier using OneVsRestClassifier (Pedregosa, 2011) and
evaluate its performance on the testing set using metrics such as accuracy, precision,
recall, and F1-score. We trained a support vector machine (SVM) model on our hate
10
speech dataset and achieved an accuracy of 78%, precision of 78%, recall of 77%, and
an F1-score of 80%.
The same to other machine learnings, we used the logistic regression classifier from
the scikit-learn library. The following hyperparameters are used:
- Penalty: The type of regularization to use. We will use L2 regularization, which adds
a penalty term to the loss function that is proportional to the squared magnitude of the
weights.
- Solver: The optimization algorithm to use. We used the 'liblinear' solver, which is
efficient for small datasets and supports L1 and L2 regularization.
- Multi-class: The method to use for handling multi-class classification. We will use
the 'ovr' (one-vs-rest) strategy, which trains a separate binary classifier for each class
and makes predictions based on the highest probability. A logistic regression (LR)
model on the same dataset and achieved an accuracy of 77.7%, precision of 79.3%,
recall of 76.6%, and F1 score of 81.0%.
After repeating the same steps as SVM, here, we trained a Random Forest classifier
on the training set using the scikit-learn library. Random Forest is a suitable algorithm
for multi-label classification tasks, as it can handle multiple labels for each sample
without the need for the MultiOutputClassifier wrapper. We trained a Random Forest
(RF) model on the dataset and achieved an accuracy of 82%, precision of 82%, recall
of 79.8%, and F1 score of 80%. The confusion matrix showed that the model had
lower precision and recall scores for the politics and race intolerance categories
compared to the SVM and LR models.
11
4.6.3. Experiment with Ensemble CNN
Convolutional Layers 1
Pooling layer GlobalMaxPooling1D
Activation functions Relu and softmax(at output layer)
Epochs 20
Batch size 64
Optimizer Adam
Training/Test size 80 % and 20 %
Table 4- 2 Model Configuration of CNN
The input embedding dimension indicates the size of vocabulary of our data which is
53473. The dropout is 0.5 which indicates that, the 50% of the neuron is dropped
which is irrelevant. We have one convolutional layer with filter size, kernel size and
activation, which is 128, 2 and RELU is our case.
The proposed CNN achieved accuracy of 85% with 86%, 84%, 85% precision, recall
and f1_score respectively.
12
Figure 4- 3 Confusion Matrix of CNN model
word2vec
Offensive 0.95 0.95 0.95
13
Politics 0.96 0.97 0.96
We have Trained Word2Vec on 60,000 words likely to provide a rich and diverse set
of word embeddings that captured the semantic relationships between words in the
corpus of our text data. These embeddings could then be used as input features for the
CNN and LR models, which can learn to classify hate speech texts based on these
features.
+word2vec
Offensive 0.94 0.93 0.94
14
Race 0.96 0.93 0.94
The CNN model is particularly effective at learning local and global features in text
data, such as n-grams and sentence structures. The LR model, on the other hand, is a
linear model that can create decision boundaries that separate different classes of text
data. By combining these models with Word2Vec embeddings, we create a more
powerful and accurate model that outperformed individual models. The LR model
with a CNN model achieved an accuracy of 95%, as the other metrics are mentioned
in the table above. The ensemble model of CNN and LR showed a very impressive
performance compared to the individual models.
15
4.5.6. Experiment Ensemble CNN+RF
+word2vec
Offensive 0.95 0.95 0.95
16
Figure 4- 6 Confusion Matrix for Ensemble CNN+RF
The accuracy result is same with CNN+LR model. However, there are some
differences in other metrics. The ensemble CNN+RF achieved better result specially
for free, offensive and race classes than CNN+LR. In general ensemble models
showed improved performance compared to the individual models, indicating that
combining different machine learning models can lead to better results in hate speech
detection.
To develop GRU (Gated Recurrent Unit) model for Afan Oromo Hate speech, first we
have to mention out a model configuration, hyperparameter tuning, and evaluation to
achieve the best results. For the GRU, the following model configurations is used.
The input dimension of the embedding layer is 53475 as mentioned in table below.
The embedding is pre-trained word2vec with about 60,000 word-embedding.
Parameter Values
Embedding dimension Input dimension = 53475, output dimension=300
Dropout rate 0.4
Number of GRU layers 3
17
Epochs 30
Batch size 64
Optimizer Adam
Training/Test size 80 % and 20 %
Table 4- 7 Model Configuration (parameter setting) of GRU
The embedding layer takes vocabulary size, weight from pre-trained word2vec
embedding matrix, output dimension and maximum length of input sequences in our
dataset. The GRU layer has 3 layers units and a dropout rate which is set to 0.4, to
prevent overfitting. The last dense layer consists of 5 units, which corresponds to the
number of classes in our problem. The last layer is the SoftMax activation function to
generate probabilities for each class. The model is trained using the
categorical_crossentropy loss function, optimized with the Adam optimizer, and
evaluated based on accuracy.
The proposed GRU achieved accuracy of 85% with 86%, 84%, 85% precision, recall
and f1_score respectively.
18
Figure 4- 7 Confusion Matrix of GRU Model
When combining the GRU and SVM, it achieved an accuracy of 90%. Indicating that
it correctly predicted 90% of the instances. In terms of recall, the model achieved a
value of 0.89, which means that it correctly identified 89% of the positive instances in
the dataset. the precision and f1_score are 91% and 90% respectively. This is an
important metric, especially in scenarios where correctly identifying positive
instances is crucial. A high recall value indicates that the model is effective at
capturing positive instances, reducing the chances of false negatives.
19
Algorithm Hate speech classes Evaluation metrics
precision recall F1_score Accuracy
GRU+SVM Free 0.88 0.90 0.89 0.90
+word2vec
Offensive 0.89 0.88 0.89
The precision of the model was measured to be 0.90, indicating that out of all the
instances predicted as positive, 90% were actually positive. Precision measures the
ability of the model to avoid false positives. A high precision value signifies that the
model is proficient in correctly classifying positive instances and minimizing false
positives.
20
Figure 4- 8 Confusion Matrix of Ensemble GRU+SVM
These results suggest that the GRU+SVM ensemble approach is effective in hate
speech detection in Afan Oromo language on social media. The combination of the
GRU model's ability to capture sequential dependencies and the SVM model's
strength in classification contributes to the overall performance of the ensemble.
In terms of recall, the ensemble model achieved a value of 0.90, meaning that it
correctly identified 98% of the positive instances in the dataset.
The precision of the ensemble model was measured to be 0.88, indicating that out of
all the instances predicted as positive, 88% were actually positive. Precision measures
the model's ability to avoid false positives. A high precision value indicates that the
model is proficient in correctly classifying positive instances and minimizing false
positives.
+word2vec
Offensive 0.89 0.88 0.88
Comparing to GRU+SVM the accuracy result is the same. However, there are some
differences in the other metrics such as precision, recall and f1_score. Overall, these
21
results demonstrate that the GRU+LR ensemble approach is effective in detecting
hate speech in Afan Oromo language on social media.
The model achieved an accuracy of 99%, indicating that it correctly predicted 99% of
the instances. The model achieved a recall of 0.98, meaning that it correctly identified
98% of the positive instances in the dataset. Recall is an important metric, especially
in scenarios where correctly identifying positive instances is crucial. A high recall
value indicates that the model is effective at capturing positive instances, reducing the
chances of false negatives. A precision of 0.98, indicating that out of all the instances
predicted as positive, 98% were actually positive. Precision measures the ability of the
model to avoid false positives.
22
Algorithm Hate speech Evaluation metrics
classes precision recall F1_score Accuracy
GRU+RF Free 0.96 0.98 0.97 0.99
+word2vec
23
The ensemble of GRU+RF outperforms all the other model in all performance
metrics.
The detail configuration of BiLSTM is depicted in the table below. Here the first layer
is consisting of embedding dimension which 53475 that indicates the size of
vocabulary in our dataset. We have a pre-trained Word2Vec model that provides word
embeddings. Each word in the vocabulary is assigned a unique vector representation,
and the weights of these vectors are learned by predicting the context words given a
target word or vice versa. The learned weights hold the semantic relationships
between words based on their co-occurrence patterns in the training dataset.
Parameters Values
Embedding dimension Input dimension=53475, output
dimension=300
Dropout rate 0.5 recurrent dropout =0.3
Memory unit 200
Epochs 30
Batch size 64
Optimizer Adam
Training/Test size 80 % and 20 %
Table 4- 12 Model Configuration of BiLSTM
The output of Embedding layer is fed to BiLSTM layer which contains of 200 units.
We used 50% dropout, which drops or deactivates 50% of neurons randomly. This
technique is used to avoid overfitting. As well as we used 30% dropout for recurrent
connections. For the training of the model, the batch size was set to 64 and the
number of epochs to 30. At the last layer, softmax activation function is used with
dense layer of 5 output units which is used to predict the class of hate speech.
+word2vec
Offensive 0.86 0.86 0.84
24
Race 0.85 0.86 0.87
The proposed BiLSTM achieved accuracy of 85%, precision of 85%, recall of 86%,
and an F1-score of 85%.
On the other hand, the individual SVM model achieved an accuracy of 78%, precision
of 78%, recall of 77%, and an F1-score of 80%. These metrics suggest that the SVM
model performed slightly lower than the BiLSTM model, but still achieved good
results.
+word2vec
Offensive 0.87 0.87 0.86
When combining the predictions of the BiLSTM and SVM models in the ensemble,
we observed an improvement in performance. The ensemble model achieved an
accuracy of 89%, precision of 88%, recall of 87%, and an F1-score of 88%. These
results indicate that the ensemble model was able to leverage the strengths of both the
BiLSTM and SVM models to achieve better overall performance compared to the
individual models.
26
The individual BiLSTM model achieved an accuracy of 85%, precision of 85%, recall
of 86%, and an F1-score of 85%. These metrics indicate that the BiLSTM model
performs well in accurately classifying instances, with a good balance between
precision and recall.
+word2vec
Offensive 0.87 0.86 0.86
When combining the predictions of the BiLSTM and LR models in the ensemble, we
observed an improvement in performance. The ensemble model achieved an accuracy
of 88%, precision of 88%, recall of 87%, and an F1-score of 87%. These metrics
indicate that the ensemble model was able to leverage the strengths of both the
BiLSTM and LR models to achieve better overall performance compared to the
individual models.
27
Figure 4- 12 Confusion Matrix of Ensemble BiLSTM+LR
The individual BiLSTM model achieved an accuracy of 85%, precision of 85%, recall
of 86%, and an F1-score of 85%. These metrics indicate that the BiLSTM model
performs well in accurately classifying instances, with a good balance between
precision and recall.
When combining the predictions of the BiLSTM and RF models in the ensemble, we
observed an improvement in performance. The ensemble model achieved an accuracy
of 99%, precision of 99%, recall of 97%, and an F1-score of 98%. These metrics
28
indicate that the ensemble model was able to leverage the strengths of both the
BiLSTM and RF models to achieve better overall performance compared to the
individual models.
+word2vec
Comparing the ensemble model to the baseline models, it outperformed both Baseline
Model A and Baseline Model B in terms of all metrics. This suggests that the
ensemble approach was effective in improving the classification accuracy, precision,
recall, and F1-score compared to the baselines.
29
Figure 4- 13 Confusion Matrix of ensemble BiLSTM+RF
One important aspect to consider is the interpretability of the models. The BiLSTM
model captures complex patterns and dependencies in sequential data but lacks
interpretability. On the other hand, the RF model provides more interpretable decision
boundaries but may not capture complex patterns as effectively. The ensemble
leverages the strengths of both models, combining their predictive power and
potential interpretability.
To develop LSTM model first we define the LSTM model architecture, which
consists of an embedding layer, LSTM layer(s), and a final output layer.
Parameters Values
Embedding dimension Input dimension = 53475, output dimension=300
Dropout rate 0.5
Memory unit 200 for both LSTM layers
Epochs 30
Batch size 64
Optimizer Adam
Training/Test size 80 % and 20 %
Table 4- 17 Model Configuration for LSTM
30
We used SoftMax activation function at output layer for multi-label(class) problem.
The dense layer has five (5) unit which represents number of classes. In addition to
that, as other deep learning, the dropout regularization is used to avoid overfitting. In
this case we used 0.5 dropout rate (50%) of neurons are dropped.
31
Figure 4- 14 Confusion Matrix of LSTM
The individual LSTM model achieved an accuracy of 80%, precision of 78%, recall
of 82%, and an F1-score of 80%. These metrics indicate that the LSTM model
performs well in accurately classifying instances, with a good balance between
precision and recall.
The individual SVM model achieved an accuracy of 78%, precision of 78%, recall of
77%, and an F1-score of 80%. While the SVM model's performance is slightly lower
than the LSTM model, it still achieves reasonable results.
32
Algorithm Hate speech Evaluation metrics
classes precision recall F1_score Accuracy
LSTM+SVM Free 0.89 0.90 0.89 0.90
+word2vec
Offensive 0.89 0.87 0.88
When combining the predictions of the LSTM and SVM models in the ensemble, we
observed an improvement in performance. The ensemble model achieved an accuracy
of 90%, precision of 91%, recall of 89%, and an F1-score of 90%. These metrics
indicate that the ensemble model was able to leverage the strengths of both the LSTM
and SVM models to achieve better overall performance compared to the individual
models.
33
The individual LSTM model achieved an accuracy of 80%, precision of 78%, recall
of 82%, and an F1-score of 80%. These metrics indicate that the LSTM model
performs well in accurately classifying instances, with a good balance between
precision and recall.
34
Figure 4- 16 Confusion Matrix of LSTM+LR
When combining the predictions of the LSTM and LR models in the ensemble, we
observed an improvement in performance. The ensemble model achieved an accuracy
of 91%, precision of 91%, recall of 89%, and an F1-score of 90%. These metrics
indicate that the ensemble model was able to leverage the strengths of both the LSTM
and LR models to achieve better overall performance compared to the individual
models.
The individual LSTM model achieved an accuracy of 80%, precision of 78%, recall
of 82%, and an F1-score of 80%. These metrics indicate that the LSTM model
35
performs well in accurately classifying instances, with a good balance between
precision and recall.
When combining the predictions of the LSTM and LR models in the ensemble, we
observed an improvement in performance. The ensemble model achieved an accuracy
of 99%, precision of 96%, recall of 99%, and an F1-score of 97%. These metrics
indicate that the ensemble model was able to leverage the strengths of both the LSTM
and LR models to achieve better overall performance compared to the individual
models.
36
Figure 4- 17 Confusion Matrix of LSTM+RF
4.6. Summary
In the study on "Ensemble Deep Learning for Multi-label Afan Oromo Hate Speech
Detection on Social Media," the following models and their combinations were
explored for hate speech detection in Afan Oromo language on social media. Here is a
summary of the models, their results, and a comparison of their performance: SVM
(Support Vector Machine): The SVM model achieved an accuracy of 78%, precision
of 78%, recall of 77%, and an F1-score of 80%. While SVM is a traditional machine
learning algorithm, its performance in capturing complex patterns and dependencies
in text data may be limited. LR (Logistic Regression): The LR model achieved an
77.7%, precision of 79.3%, recall of 76.6%, and F1 score of 81.0%. Similar to SVM,
logistic regression is a linear classifier that may not effectively capture complex
patterns in text data. RF (Random Forest): The RF model achieved an accuracy of
84%, precision of 85%, recall of 82%, and an F1-score of 84%. While Random Forest
can handle multi-label classification, it may not capture the sequential nature of text
data as effectively as deep learning models.
The CNN model achieved an accuracy of 85% with 86%, 84%, 85% precision, recall
and f1_score respectively. CNNs can capture local dependencies and patterns in text,
making them suitable for hate speech detection. However, they may not effectively
capture long-term dependencies. The BiLSTM model achieved an accuracy of 86%,
37
precision of 85%, recall of 86%, and an F1-score of 85%. BiLSTM models can
capture both forward and backward dependencies, making them effective in capturing
long-term patterns in text data. The individual LSTM model achieved an accuracy of
80%, precision of 78%, recall of 82%, and an F1-score of 80%. The GRU achieved
accuracy of 85% with 86%, 84%, 85% precision, recall and f1_score respectively.
38
Chapter 5: Conclusions and Recommendations
1.1. Conclusions
The pervasive presence of social media platforms has become a powerful tool for
communication, enabling individuals to share their thoughts, opinions, and
experiences with a global audience. While this unprecedented connectivity has
brought numerous benefits, it has also given rise to a concerning issue: the
proliferation of hate speech. Hate speech, characterized by discriminatory, offensive,
or harmful content targeting individuals or groups based on attributes such as race,
religion, ethnicity, or politics, etc. poses a significant threat to social harmony, online
discourse, and ultimately, society as a whole. Detecting hate speech in social media
content is an essential task, not only for maintaining a respectful and inclusive online
environment but also for legal and ethical reasons. Moreover, it is particularly
challenging in languages with limited resources and tools, such as Afan Oromo, one
of the widely spoken languages in East Africa. Afan Oromo has gained prominence in
the digital landscape due to its use in various social media platforms, making it crucial
to develop effective hate speech detection methods tailored to this language.
This study aims to address this pressing issue by proposing an ensemble deep learning
approach for multi-label Afan Oromo hate speech detection on social media.
Ensemble learning combines multiple machine learning models to enhance predictive
performance and robustness. In this study, various models were explored for hate
speech detection in the Afan Oromo language on social media. The models
considered included support vector machine, logistic regression, random forest,
convolutional neural network, long short-term memory, Gated Recurrent Unit and
Bidirectional LSTM.
In this study we collected about 35,000 public comments from three different social
media networks namely, Facebook, twitter and YouTube. The dataset collected is
annotated with four annotators including the researcher. We applied different pre-
processing techniques such as stop word removal, normalization and cleaning.
In order to see the difference, first of all we developed and tested with individual
models. After developing and testing the result of the individual models, we combined
the prediction of each individual models (traditional machine learning) with deep
39
learning. The SVM model achieved an accuracy of 78%, precision of 80%, recall of
76%, and an F1-score of 78%. While SVM is a traditional machine learning
algorithm, its performance in capturing complex patterns and dependencies in text
data may be limited. The LR model achieved an accuracy of 82%, precision of 81%,
recall of 84%, and an F1-score of 82%. Similar to SVM, logistic regression is a linear
classifier that may not effectively capture complex patterns in text data. The RF
model achieved an accuracy of 84%, precision of 85%, recall of 82%, and an F1-score
of 84%. While Random Forest can handle multi-label classification, it may not
capture the sequential nature of text data as effectively as deep learning models.
The CNN model achieved an accuracy of 85% with 86%, 84%, 85% precision, recall
and f1_score respectively. CNNs can capture local dependencies and patterns in text,
making them suitable for hate speech detection. However, they may not effectively
capture long-term dependencies. BiLSTM (Bidirectional LSTM): The BiLSTM
model achieved an accuracy of 85%, precision of 85%, recall of 86%, and an F1-score
of 85%. BiLSTM models can capture both forward and backward dependencies,
making them effective in capturing long-term patterns in text data. The individual
LSTM model achieved an accuracy of 80%, precision of 78%, recall of 82%, and an
F1-score of 80%. The GRU achieved accuracy of 85% with 86%, 84%, 85%
precision, recall and f1_score respectively.
All of the ensemble models achieved even better results compared to the individual
models. The ensemble combinations include CNN+SVM (95%), CNN+LR,
CNN+RF, LSTM+SVM, LSTM+LR, LSTM+RF, BiLSTM+SVM, BiLSTM+LR,
BiLSTM+RF, GRU+SVM, GRU+LR and GRU+RF. The ensemble models leverage
the strengths of different models and achieve higher accuracy, precision, recall, and
F1-scores. The accuracy result of the ensemble of CNN with the three different
machine learning is the same, which is 95% except of the precision, recall and
f1_score of specific classes. The ensembles of GRU+SVM is 90%, GRU+RF is 99%
and GRU+LR 90%. The ensemble of GRU with SVM and LR the same but, there is a
few differences between their precision, recall and f1_score. The accuracy results of
the ensembles BiLSTM+SVM, BiLSTM+LR and BiLSTM+RF is 89%, 88% and
99% respectively. The last ensemble models are LSTM with machine learning. The
LSTM+SVM achieved accuracy of 90%, LSTM+LR 91% and LSTM+RF 99%. From
40
this result we consider that BiLSTM, GRU and LSTM with Random Forest (RF)
outperformed all the other ensemble models.
In general, Comparing the individual models, the ensemble models, and the traditional
machine learning models, the ensemble models consistently outperform the individual
models and traditional machine learning algorithms. The ensemble models
successfully combine the strengths of different models and improve overall
performance in hate speech detection on social media.
41
1.2. Recommendations and Future Directions
The primary purpose of this study was to design and develop an automated system
that can efficiently work for ensemble deep learning for multi label Afan Oromo hate
speech detection on social media. In order to develop a full-fledged automated system
that can efficiently work for ensemble deep learning for multi label Afan Oromo hate
speech detection on social media, it needs coordinated teamwork from linguistic
expertise and computer science expertise. Even if the results are promising for this
study, our research has many more limitations. Future work and recommendations for
further improvement in hate speech detection on social media using ensemble deep
learning models for the Afan Oromo language include:
42
Online Learning and Real-Time Detection: Investigating methods to adapt the
ensemble models for online learning, where the models can be continuously
updated with new data, can enable real-time hate speech detection on social
media platforms.
By addressing these future work areas and recommendations, the performance and
effectiveness of ensemble deep learning models for Afan Oromo hate speech
detection can be further enhanced, leading to more accurate and robust detection
systems on social media platforms.
43
44