Final Draft

BULE HORA UNIVERSITY
COLLAGE OF INFORMATICS
DEPARTMENT OF SOFTWARE ENGINEERING
ENSEMBLE DEEP LEARNING FOR MULTI LABEL AFAAN OROMO HATE

SPEECH DETECTION ON SOCIAL MEDIA
BY: DABA ADUGNA
ATHESIS SUBMITTED TO COLLEGE OF INFORMATICS DEPARTMENT OF

SOFTWARE ENGINEERING IN PARTIAL FULFILLMENT OF THE
REQUIREMENT FOR THE MASTERS OF DEGREE IN SOFTWARE
ENGINEERING
September, 2023
BULE HORA, ETHIO

Abstract
Hate speech is the speech that intentionally promotes discrimination, hatred, or attack against a
discernable group of identity or person, based on race, ethnicity, gender, religion or disability.
social media platforms have become a breeding ground for hate speech, which has the potential to
cause serious harm and perpetuate discriminatory behavior against marginalized communities. In
this study we aimed to develop an Ensemble Deep learning Model for Afan Oromo Hate speech
detection that could accurately classify comments into distinct categories. We collected a dataset
consisting of 35,000 comments from various sources, categorized into five different classes: free
speech, offensive language, religion, race, and politics. We employed various techniques and
models to analyze the dataset and achieve reliable classification results. Preprocessing steps, such
as text cleaning, and tokenization were applied to ensure the data was suitable for analysis. We
also developed a pre-trained word vector(word2vec) consisting of 60,000 words. We then
explored different machine learning algorithms, including traditional models like Support Vector
Machines (SVM), Random Forest (RF) and Logistic Regression (LR), as well as more advanced
models like Convolutional Neural Networks (CNN), Bidirectional Long Short-Term Memory
(BiLSTM), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU). After
experimented with those individual models, we examined various ensemble models that
combined different machine learning algorithms. These ensemble models demonstrated improved
performance compared to the individual models. The ensemble models leveraged the strengths of
different models, resulting in higher accuracy, precision, recall, and F1-scores. The accuracy
result of the ensemble of CNN with the three different machine learning algorithms was
consistently 95%, except for precision, recall, and F1-scores for specific classes. Among the
ensemble models, GRU+RF and LSTM+RF achieved the highest accuracy of 99%. GRU+SVM
achieved an accuracy of 90%, while GRU+LR achieved 90% accuracy with slight differences in
precision, recall, and F1-scores. The ensemble models combining BiLSTM with SVM, LR, and
RF achieved accuracies of 89%, 88%, and 99% respectively. Lastly, the ensemble models
combining LSTM with SVM, LR, and RF achieved accuracies of 90%, 91%, and 99%
respectively. Based on these results, we concluded that BiLSTM, GRU, and LSTM with Random
Forest (RF) outperformed all other ensemble models in terms of accuracy, precision, recall, and
F1-scores. These findings provide valuable insights for developing effective hate speech
detection systems in the Afan Oromo language on social media platforms.
2
Keywords: Afan Oromo, Hate speech, Multi-label, Ensemble, machine learning
Table of Contents
Abstract..........................................................................................................................1
Chapter 1: Introduction..................................................................................................7
1.1. Background of the study..................................................................................7
1.2. Motivation.......................................................................................................8
1.3. Statement of Problem......................................................................................9
1.4. Significance of the Study...................................................................................10
1.2. Research questions............................................................................................10
2.1. Introduction.......................................................................................................12
2.2. Hate speech........................................................................................................12
2.2.1 Target and Categories of hate Speech..........................................................13
2.3. The Social Media...............................................................................................14
2.4. Hate Speech on social media in Ethiopia..........................................................14
2.5. Natural language processing and Afaan Oromo................................................15
2.5.1. Natural Language Processing for Hate Speech Detection..........................15
2.5.2. Overview of Afaan Oromo.........................................................................16
2.6. Challenges in Hate Speech Detection in Afaan Oromo....................................19
2.7. Hate speech detection Techniques.....................................................................19
2.7.1. Feature extraction...........................................................................................19
2.7.3. Deep learning approach for Hate Speech Detection...................................24
2.7.3.5. Bi-Directional Long Short-Term Memories (Bi-LSTM).........................30
2.8. Ensemble deep learning.....................................................................................31
2.8.1. Averaging Ensemble...................................................................................32
3
2.8.2. Bagging Ensemble......................................................................................32
2.8.3. Meta-learning Ensemble.............................................................................32
2.9. Related work......................................................................................................33
1.1. Data Collection Method................................................................................40
1.2. Data Sources..................................................................................................40
3.2. Data Preparation................................................................................................42
3.3.1. Data Annotation..........................................................................................42
3.3.2 Manual annotation of text using annotation guidelines...............................43
3.3.3. Validation by expertise...............................................................................43
3.3.4. Finalization of annotation based on inter annotator agreement..................43
3.4. Data pre-processing.......................................................................................44
3.4.1. Data Cleaning.............................................................................................45
3.4.2. Stop word Removal....................................................................................45
3.4.3. Short Word Expansion................................................................................45
3.4.4. Data normalization......................................................................................46
3.4.5. Tokenization...............................................................................................46
3.4.6. Lowercase...................................................................................................46
3.4.7. Stemming....................................................................................................46
3.5. Feature Engineering...........................................................................................46
3.6. Splitting Dataset................................................................................................47
3.7. Training of Models............................................................................................47
3.7.1. Training set.................................................................................................47
3.7.2. Selection of models.....................................................................................48
3.7.3. Training of deep learning models...............................................................48
3.8. Evaluation methods...........................................................................................48
4
3.8.1. Inter Annotator Agreement (IAA)..............................................................48
3.8.3. Model evaluation methods..........................................................................48
3.8.4. Visualization...............................................................................................49
3.9. The Proposed Ensemble model.........................................................................50
Chapter 4: Results and Decisions.................................................................................52
4.1. Data source....................................................................................................52
4.2. Data Annotation.............................................................................................52
4.3. Data pre-processing.......................................................................................53
4.4. Evaluation Metrics.........................................................................................55
4.5. Experiment Results........................................................................................56
4.5.1. Experiment with SVM...............................................................................56
4.5.2. Experiment with Logistic Regression........................................................57
4.5.3. Experiment with Random Forest...............................................................57
4.5.4. Experiment with Ensemble CNN...........................................................58
4.5.4. Experiment with Ensemble CNN+SVM....................................................59
4.5.5. Experiment with Ensemble CNN+LR.......................................................60
4.5.6. Experiment Ensemble CNN+RF...............................................................62
4.5.7. Experiment with GRU...............................................................................63
4.5.8. Experiment with GRU+SVM.....................................................................65
4.5.9. Experiment with GRU+LR........................................................................67
4.5.10. Experiment with GRU+RF.......................................................................68
4.5.11. Experiment with BiLSTM.......................................................................70
4.5.12. Experiment with BiLSTM+SVM............................................................71
4.5.13. Experiment with BiLSTM+LR................................................................73
4.5.14. Experiment with BiLSTM+RF................................................................74
5
4.5.15. Experiment with LSTM...........................................................................76
4.5.16. Experiment with LSTM+SVM................................................................77
4.5.17. Experiment with LSTM+LR....................................................................79
4.5.18. Experiment with LSTM+RF....................................................................80
4.6. Summary........................................................................................................82
Chapter 5: Conclusions and Recommendations...........................................................84
5.1. Conclusions...................................................................................................84
5.2. Recommendations and Future Directions.....................................................87
6
List of Tables
TABLE 1: THE QUBEE IN INTERNATIONAL PHONETIC WRITING[28].........................................................17

TABLE 2: SUMMARY OF RELATED WORK.................................................................................................37
TABLE 3 SUMMARY OF PUBLIC PAGES TO RETRIEVE COMMENTS AND POSTS AND COMMENTS FROM
FACEBOOK AND TWIT FROM TWITTER...........................................................................................41
TABLE 4- 1 EXAMPLE OF CONFUSION MATRIX.......................................................................................56

TABLE 4- 2 MODEL CONFIGURATION OF CNN........................................................................................58
TABLE 4- 3 RESULT OF CNN...................................................................................................................59
TABLE 4- 4 RESULT OF ENSEMBLE CNN+SVM......................................................................................60
TABLE 4- 5 RESULT OF ENSEMBLE CNN+LR.........................................................................................61
TABLE 4- 6RESULT OF ENSEMBLE CNN+RF...........................................................................................62
TABLE 4- 7 MODEL CONFIGURATION (PARAMETER SETTING) OF GRU..................................................64
TABLE 4- 8 RESULT REPORT OF GRU......................................................................................................64
TABLE 4- 9 RESULT OF ENSEMBLE GRU+SVM......................................................................................66
TABLE 4- 10 RESULT OF ENSEMBLE GRU+LR.......................................................................................67
TABLE 4- 11 RESULT OF GRU+RF..........................................................................................................69
TABLE 4- 12 MODEL CONFIGURATION OF BILSTM................................................................................70
TABLE 4- 13 RESULT OF BILSTM...........................................................................................................71
TABLE 4- 14 RESULT OF ENSEMBLE BILSTM+SVM..............................................................................72
TABLE 4- 15 RESULT OF ENSEMBLE BILSTM+LR.................................................................................73
TABLE 4- 16 RESULT OF ENSEMBLE BILSTM+RF..................................................................................75
TABLE 4- 17 MODEL CONFIGURATION FOR LSTM..................................................................................76
TABLE 4- 18 RESULT OF LSTM...............................................................................................................76
TABLE 4- 19 ENSEMBLE RESULT OF LSTM+SVM..................................................................................78
TABLE 4- 20 ENSEMBLE RESULT OF LSTM+LR.....................................................................................79
TABLE 4- 21 CONFUSION MATRIX OF LSTM+RF...................................................................................81
7
Table of Figures
FIGURE 4- 1 STATISTICS OF DATASET......................................................................................................52

FIGURE 4- 2 RF CONFUSION MATRIX......................................................................................................58
FIGURE 4- 3 CONFUSION MATRIX OF CNN MODEL.................................................................................59
FIGURE 4- 4CONFUSION MATRIX OF ENSEMBLE CNN+SVM.................................................................60
FIGURE 4- 5 CONFUSION MATRIX OF ENSEMBLE CNN+LR....................................................................62
FIGURE 4- 6 CONFUSION MATRIX FOR ENSEMBLE CNN+RF..................................................................63
FIGURE 4- 7 CONFUSION MATRIX OF GRU MODEL................................................................................65
FIGURE 4- 8 CONFUSION MATRIX OF ENSEMBLE GRU+SVM................................................................66
FIGURE 4- 9 CONFUSION MATRIX OF GRU+LR......................................................................................68
FIGURE 4- 10 CONFUSION MATRIX OF GRU+RF....................................................................................69
FIGURE 4- 11 CONFUSION MATRIX OF ENSEMBLE BILSTM+SVM........................................................72
FIGURE 4- 12 CONFUSION MATRIX OF ENSEMBLE BILSTM+LR............................................................74
FIGURE 4- 13 CONFUSION MATRIX OF ENSEMBLE BILSTM+RF............................................................75
FIGURE 4- 14 CONFUSION MATRIX OF LSTM.........................................................................................77
FIGURE 4- 15 CONFUSION MATRIX OF LSTM+SVM..............................................................................78
FIGURE 4- 16 CONFUSION MATRIX OF LSTM+LR..................................................................................80
FIGURE 4- 17 CONFUSION MATRIX OF LSTM+RF..................................................................................81
8
Chapter 1: Introduction
1.1 Background of the study

Today, more than ever before, citizens are communicating without delineation of
boundaries, language, culture, citizenship, and others through the giant hollow of
information, social media. Above all, the rise of social media gives power to the
uninformed mass, pledges them an option to voice their concerns, and expresses
themselves freely, which traditional media cannot allow them to do[1]. While these social
media sites provide platforms for people to interact with each other and share news, they
also have a dark side. The tremendous use of these platforms has led to the promulgation
of hateful and offensive content, resulting in violence. Hate speech is any direct or
indirect expression of animosity against an individual or group that is motivated by an
aspect of that individual or groupFactors that are often used as bases of hatred include
ethnicity, religion, disability, gender, and sexual orientation. Spreading hate speech is an
extremely risky activity that can lead to discrimination, societal strife, and even human
genocide[2]. Ethiopia has diverse cultures, religions, ethnic groups, and languages. Any
kind of hate speech against a targeted group or individuals can be a reason for conflict in
Ethiopia.
In everyday life, especially in social media, hate speech spreading is often accompanied
by offensive language[3]. Offensive language is an utterance that contains offensive
words(phrases) that are conveyed to the inter locator (individuals or groups), both
verbally and in writing[2]. Hate speech that contains offensive words/phrases often
accelerates the occurrence of social conflict because of the use of offensive
words/phrases that triggers emotions. Although offensive language is some-times just
being used as jokes (not to offend some one), the use of offensive language in social
media still can lead to conflict because of misunderstandings among citizens.
Hate speech and offensive language on social media must be detected to avoid conflicts
between citizens. Social media platforms including Facebook, Instagram, and Twitter
struggling to use artificial intelligence (AI) applications on their site to block hate speech
automatically and make safe environments for their customers. Many academics have
been studying the detection of hate speech in recent years. The majority of current studies
9
on the detection of hate speech focus on high-resource languages like English. But today,
hate speech and insults directed at specific persons or groups are frequently expressed on
social media. As a result, there has to be a system for detecting hate speech in languages
with limited resources, such as Afaan Oromo. [4],[5] proposed a machine learning
technique, and [6],[7] suggested a deep learning method for detecting hate speech in texts
written in Amharic on social media. Some scholars also researched hate speech detection
in the Afaan Oromo languages. For instance, the author[8] conducted a comparison of
Afaan Oromo hate speech detection using machine learning and deep learning. The work
of [9] introduces Afaan Oromo Hate Speech Detection and Classification on Social
Media using different Machine learning and deep learning. There are also some other
existing works on hate speech detection using machine learning[10],[11] and deep
learning[12],[13][14]. Beyond deep learning, today many researchers follow an ensemble
of deep learning models to detect hate speech[15][16][3]. Even though the ensemble
approach is best for hate speech detection, none of the researchers still works for Afaan
Oromo Hate speech detection. To fill this gap we intend to follow an ensemble deep-
learning Approach for this study.
In this research, we investigate the identification of offensive language and hate speech in
Afaan Oromo social media. Facebook ,Twitter and youtube are the social media network
in Ethiopia that are frequently used to promote hate speech and offensivelanguage, hence
we choose Facebook,twitter and youtube as our source of dataset. This is a multi-label
text classification issue where texts may contain offensive language, free speech, hate
speech politics, or religious hate speech.
To properly conduct our research on offensive language and hate speech detection, First,
a corpus of comments and posts were retrieved from Facebook, Twitter and
youtube.After that, features are extracted using word embedding techniques such as
Word2Vec[17]. Using machine learning models, such as support vector machine
(SVM),Logistic regression(LR),Random Forest(RF),and deep learning models including
convolutional neural networks (CNN) ,long short-term memory networks (LSTM[18],
gated recurrent unit(GRU), and bidirectional long short-term memory networks (Bi-
LSTM[19]) are employed in the detection of hate speech. We then implement an
10
ensemble of deep learning models based on the individual classifiers. We hypothesize
that the combination of multiple classifiers will lead to more accurate performance
results.
1.2. Motivation
As social media users, we saw that many people in Ethiopia now release hate speech on a
regular basis, especially against particular ethnic groups. In Ethiopia, the issue even
challenged the government, and multiple times the Ethiopian government attempted to
control the effects by interrupting internet connectivity[20]. Despite their efforts to build
AI for hate speech detection, social media companies like Facebook and Twitter continue
to face a difficult problem: social media hate speech. A recent declaration for the
prohibition of hate speech from the Ethiopian government requires social media service
providers to delete hate speech content from their sites. This is due to the fact that any
hate speech on social media might result in significant conflict in actual society. The
motivation behind this study is to address the gap in hate speech detection for the Afaan
Oromo language. We want to increase the efficiency of hate speech detection models and
help create a safer online environment by creating an ensemble deep learning technique.
We believe that by leveraging the power of ensemble models, we can enhance the
accuracy and robustness of hate speech detection, enabling proactive interventions and
promoting inclusivity, tolerance, and respectful dialogue.
1.3. Statement of the problem

In recent years, social media platforms have become a breeding ground for hate speech,
which has the potential to cause serious harm and perpetuate discriminatory behavior
against marginalized communities. Afaan Oromo, one of the largest ethnic groups in
Ethiopia, is no exception to this trend, as online platforms have been used to spread hate
speech and incite violence against this community. Therefore, the need for effective hate
speech detection and classification mechanisms to identify and mitigate these harmful
activities is more urgent than ever.
Despite the progress made in hate speech detection using deep learning techniques, the
performance of individual models is often limited by factors such as bias in training data,
11
noise in the input, and overfitting. To address these issues, researchers have explored the
use of ensemble deep learning techniques, which combine multiple models to improve
the overall accuracy and reliability of hate speech detection. The efficiency of ensemble
deep learning approaches for the detection and categorization of Afaan Oromo hate
speech is still not well studied, though. We thus suggested an ensemble deep learning-
based hate speech detection system for Afaan Oromo to fill this need.
1.4. Significances of the Study

The study has many advantages in terms of hate speech concepts. Detecting hate speech
is important for online communities for maintaining safe environments for their users and
is a responsibility considering their impact on society. Hate speech detection will help for
the reduction of time and human effort to identify verbal attacks on social media. The
system will help to filter any hateful comment and post that makes peoples of the local
population indirectly or directly participate in violent activities across the different region
of the country. The result of this research is also used as input for further research
investigation in the area of hate speech detection. And it plays a great role in daily life.
1.5. Research questions

1. How effective are ensemble deep learning models in detecting Afaan Oromo hate
speech on social media platforms?
2. What is the optimal architecture and configuration of an ensemble deep learning model
for Afaan Oromo hate speech detection?
3. How does the proposed ensemble deep learning approach compare to individual deep
learning models in terms of accuracy, robustness, and generalization capability for Afaan
Oromo hate speech detection?
1.6. Objectives
1.6.1. General Objectives

The general objective of this research is to develop An Ensemble of Deep learning
models for Multi-label Afaan Oromo Hate Speech Detection.
12
1.6.2. Specific Objectives
To achieve the aforementioned general objectives the following specific tasks have been
performed:
 To prepare a hate speech dataset using Afaan Oromo text on social media pages
and used for training and testing purposes.
 To implement annotation rules for labeling posts and comments.
 To develop an ensemble deep learning-based hate speech detection model that
combines multiple models to improve the accuracy and robustness of hate speech
detection.
 To compare the performance of the ensemble model with individual deep learning
models commonly used for hate speech detection.
 To investigate the contribution of individual models in the ensemble and analyze
the impact of different ensemble techniques on hate speech detection
performance.
1.7. Scope and Limitation of the Study.
1.7.1. Scope of The study

The primary objective of this work is to create an ensemble of deep learning models to
detect Afaan Oromo hate speech texts on social media. Textual data were utilized to
analyses hate speech and other objectionable words in this study. Data including music,
emotional symbols, photos, or video material are not included in the research.
Additionally, it excludes other social media platforms like, TikTok, Telegram, etc. Afaan
Oromo languages lack a publicly accessible data set. By scraping FacebooTwitter and
you tube pages, the Afaan Oromo dataset is created from scratch.
In this work, some machine learning models and deep learning model variants such as
CNN, Bi-LSTM, LSTM, and GRU are trained, and their output is combined to enhance
the performance of the models. The dataset is divided into five categories: free speech,
hate speech motivated by religion, hate speech motivated by politics, hate speech
motivated by race, and offensive language.
13
1.7.2. Limitation of the study
Due to the complexity of the task of data annotation and also because we have a limited
number of annotators, we are unable to create a large dataset. The other big issue is
concerning the computational resource which prohibited us from carrying out more GPU-
intensive processes like automatic hyper parameters tunings, as we got a low-end GPU.
We were also prevented from creating models for the detection of hate speech in non-
textual data, such as audio and video, due to a lack of time and dataset.
14
Chapter 2: Literature Review and Related Work
To further understand the topic and explore the research challenge, this chapter explores
pertinent literature. The paper includes a definition of hate speech, applications for hate
speech detection systems, and methods for currently identifying hate speech. Definitions
of hate speech and social media are provided first. The chapter then reviews hate speech
on social media, provides an introduction to the Afaan Oromo, discusses hate speech
detection methods, and reviews relevant literature on social media hate speech detection.
2.1. Hate speech

Hate speech and offensive language detection is a widely contested subject among
specialists since there isn't a single definition that everyone can agree on. Hate speech is
defined as expressions that are prejudicial, antagonistic, or malevolent against a person or
group due to their real or obvious physical traits[6]. The term "hate" is frequently used to
characterize communication that uses slurs, poisonous language, or verbally abuses a
variety of targets without making a distinction between "hate speech" and other sorts of
speech like offensive or insult[21]. The definitions of hate speech vary depending on the
viewpoints of the various authors. Social networking websites, governmental and non-
governmental organizations, the United Nations, and trade groups are a few examples of
these sources. Below is a list of definitions and the sources they came from:
Definition 1: Hate speech is described by the United Nations[22], as "any kind of

communication in speech, writing or behavior, that attacks or uses discriminatory
language with a reference to a person or a group based on who they are, what they are,
based on their religion, gender, race, descent, ethnicity or other identity factors."
Definition 2: Facebook defines hate speech as "object able content that directly attacks
people based on what we call protected characteristics such as race, ethnicity, national
origin, religious affiliation, sexual orientation, caste, sex, gender, and gender identity."
Additionally, we offer various safeguards based on immigrant status. Attack is what we
characterize as "violent or dehumanizing speech, inferiority statements, or calls for
exclusion or segregation."
15
Definition 3: According to the European Court of Human Rights[23], hate speech shall
be deemed to include all expressions which spread, incite, promote, or otherwise justify
racial hatred, xenophobia, anti-Semitism, or other forms of hatred based on intolerance,
including intolerance expressed through aggressive nationalism and ethnocentrism,
discrimination against minorities, migrants, and people of immigrant origin”.
Definition 4: "Hate speech is the speech that intentionally promotes discrimination,

hatred, or attack against a discernible group of identity or person, based on race,
ethnicity, gender, religion, or disability," [24] according to the Ethiopian Proclamation
No. 1185/2020 on page 12339.
2.3. The Social Media

People use social media to voice their worries and opinions and to share various types of
content, such as thoughts, information, videos, images, and texts [24]. Popular social
media platforms like Facebook, Twitter, and YouTube are in use today and have become
crucial in facilitating tasks like email, online business, and online education. However,
social media also has disadvantages that can negatively affect society. The platforms'
ability to disseminate inappropriate or illegal information can have negative
consequences for both users and society at large.
2.4. Hate Speech on social media

People can post and share their opinions and feelings online due to the nature of social
media, which also provides the ability to create and share content. Using this opportunity,
people can post and share illegal content such as hate speech, cyberbullying, and
offensive language. These internet venues are frequently abused and used to propagate
information that criticizes or denigrates, or incites hatred or violence against specific
people or groups. Because of social media's inherent security and privacy features, users
can express or disseminate hostile ideas more easily than they might be able to in other
settings[20]. Recently, authorities in several nations have begun to view hate speech as a
major issue. Hate speech on social media has a negative influence on an open, egalitarian
society in addition to its members' wellbeing. These days, social media sites like
Facebook and Twitter have connected billions of individuals and enabled instantaneous
16
sharing of their thoughts and opinions. However, there are also a number of frightening
repercussions, such as online harassment, trolling, cyberbullying, and hate speech.
In Ethiopia, hate speech on social media has become a major problem with the increase in
social media users in the country. The lack of legal laws or recommendations that
indirectly define or tackle hate speech makes the problem more difficult. However, there
is a law that is indirectly used for hate-related issues, such as anti-terrorism law, the law
forbids “the use of any telecommunication network or apparatus to broadcast any
terrorizing information” or “obscene message” obscene message’, which includes the use
of any social media platform and other forms of commutation platforms to disseminate
terrorizing messages. The subject had violated a prison sentence for up to eight years.
However, the law has been used to limit messages or speeches that criticize government
policy and officials. This law has received a reaction from national and international
organizations and the academic community because its use contradicts the freedom of
speech of human rights law. Currently, law-maker, government officials, and politicians
in Ethiopia have been aware of hate speech on social media and lawmakers drafted a new
hate speech and fake news law on the march 23 of 2020 by the proclamation cited as
“Hate Speech and Disinformation Suppression and Prevention Proclamation No.1185
/2020”[24].
2.5. Natural language processing and Afaan Oromo
2.5.1. Natural Language Processing for Hate Speech Detection

The intersection of computer science, artificial intelligence, and linguistics is the subject
of the study known as "natural language processing," or NLP [25]. In the 1950s, the field
of natural language processing (NLP), which combines linguistics and artificial
intelligence, initially emerged. Text information retrieval (IR), which use statistical
techniques to efficiently index and search vast volumes of text, was not a component of
the development of NLP [26].
NLP is a collection of algorithms and methods used to extract meaning and grammatical
structure from input natural language to perform useful tasks like the generation of
natural language that helps to build output depending on the rules of the target language,
17
which is spoken by people from a specific region and the specific task [6]. Due to its
ability to foster more engagement and productivity, NLP is valuable in the disciplines of
database interface, duplication detection, computer-supported teaching, and tutoring
systems [25]. The techniques of NLP are developed in such a way that the commands
given in natural language can be understood by the computer and also be able to perform
according to it. It should be noted that natural language processing can be divided into
two parts, namely written and spoken language. The 'levels of language' technique is the
most illustrative way to explain what happens within a Natural Language Processing
system. [27]. People utilize these levels to decode spoken or written languages and
determine their meaning. This is due to the fact that language processing primarily uses
formal models or representations of information pertaining to various levels.
Additionally, by utilizing the language's expertise, language processing applications set
themselves apart from data processing systems. The analysis of natural language
processing has the following levels: Phonology, Morphology, Lexical, Syntactic,
Semantic, Discourse, and Pragmatic. Afaan Oromo is one of the natural languages on
which we can do various research on it using NLP.
2.5.2. Overview of Afaan Oromo

After Hausa in Nigeria, Afaan Oromo is the second most extensively used indigenous
language in Africa [28]. Cushitic languages, which are spoken in Ethiopia, Somalia,
Sudan, Tanzania, and Kenya, include this one as one of its highly developed languages.
[29]. The most widely used and spoken language in Ethiopia is Afaan Oromo, having the
largest population of speakers [30]. In Oromia state, Ethiopia, it is the official language
used in courts, schools, and administration [31]. Currently, there are a growing number
of publications in hard copies and a vast amount of information in electronic formats for
Afaan Oromo. This was created for linguistic, educational, and practical purposes to
make it easier to write Afaan Oromo in the Latin script. Since then, it's thought that
several times more texts have been written in Afaan Oromo than ever before. There are
dialectical variations in Afaan Oromo because it is a widely spoken language throughout
the nation. Afaan Oromo shows variations based on the geographical areas where it is
spoken. Different attempts have been made by scholars to classify the dialects of Afaan
Oromo based on the geographical background of the speakers. However, there have been
18
differences and inconsistencies among scholars and researchers regarding the number and
categorizations of the dialects[32].These are Western (Wellega, Iluababor, Kaffa, and
parts of Gojjam), Eastern (Harar, Eastern showa, and arts of Arsi and Bale), Central
(Central Showa, Western Showa, and possibly Wollo), and Southern (Parts of Arsi,
Sidamo, and Borena).
2.5.2.1. Afaan Oromo Writing System

The alphabets of Afaan Oromo are often called “Qubee” Afaan Oromo[29]. Qubee has 33
characters representing distinct sounds. It has both capital and small letters. Afaan Oromo
has a considerable amount of glottal stops. An apostrophe, and less commonly a hyphen,
is used “’” to represent this sound in writing. Sometimes an H, which represents the
closest glottal sound, is also used in place of an apostrophe. For a reason to be apparent
later, the apostrophe will be considered a distinct symbol (say, as the 27th letter of the
alphabet). Geminated consonants and long vowels are represented by double characters in
the Afaan Oromo writing system. Afaan Oromo is a phonetic language, which means that
is spoken in the way it is written. It is termed "Dubbiftu/Dubbachiftu" in Afaan Oromo
for the vowels represented by the letters a, e, o, u, and i, and "dubbifamaa" for the
consonants.
Qubee IPA Qube IPA Qube IPA

A a /a/ L L /l/ W w /w/
B b /b/ M M /m/ X x /x/
C c /c/ N N /n/ Y y /y/
D d /d/ O O /o/ Z z /z/
E e /e/ P P /p/ CH ch /c/
F f /f/ Q Q /q/ DH dh /d/
G g /g/ R R /r/ NY ny /n/
H h /h/ S S /s/ PH ph /p/
I i /i/ T T /t/ SH sh /s/
J j /j/ U U /u/ TS ts /t/
K k /k/ V V /v/ ZH zh /z/
19
Table 1:The Qube international phonetic writing[29]
2.5.2.2. Word Structure in Afaan Oromo

In Afaan Oromo, words are the fundamental elements of language that carry meaning and
can be spoken or written. They are made up of two parts: the root, which provides the
core lexical meaning and is typically a basic sound, and the pattern, which consists of
prefixes and/or suffixes and contributes grammatical meaning. For instance, when the
root 'bar' is combined with the pattern '-e', it forms the word 'bare' meaning 'learned',
while the same root combined with the pattern '-te' creates 'barte' meaning 'she learned'.
Afaan Oromo words can be classified into five grammatical categories: nouns, verbs,
adverbs, adjectives, and ad-positions[33].
2.5.2.3. Afaan Oromo Sentence Structure

Afaan Oromo follows a subject-object-verb (SOV) structure, which is different from the
subject-verb-object (SVO) structure in English. For example, in the Afaan Oromo
sentence "Gaaddiseen barattuu dha," "Gadise" functions as the subject, "barattuu" as the
object, and "dha" as the verb. The English translation of this sentence is "Gadise is a
student."
The way adjectives are formed in Afaan Oromo and English is another distinction. In
Afaan Oromo, adjectives are typically used after the noun or pronoun they modify. They
are also commonly used in close proximity to the noun. On the other hand, in English,
adjectives are frequently used before nouns. For instance, in the phrase "ilma gaarii" (nice
boy), the adjective "gaarii" follows the noun "ilma")[34].
2.5.2.4. Afaan Oromo Punctuation
Apart from apostrophes, the punctuation marks utilized in Afaan Oromo and English
serve the same purposes and are similar. In Afaan Oromo, the apostrophe mark (’) is
employed in writing to represent a specific sound called "hudhaa," whereas, in English, it
is used to indicate possession. The apostrophe mark (’) in Afaan Oromo plays a
significant role in both reading and writing systems. Punctuation marks are employed in
the text to enhance clarity of meaning and facilitate reading. Afaan Oromo follows the
20
punctuation patterns observed in English and other languages that adopt the Latin writing
system.
Tuqaa (full stop(.)) used as an acronym and at the end of

sentences
Mallattoo gaaffii(question mark(?)) used to terminate queries or in
interrogative sentences
Rajeffannoo(exclamation mark(!)) used to complete command and
exclamatory statements
Qoodduu(comma(,)) It is used to break up lists in sentences
or to divide up a succession of parts.
Tuq-lamee(colon(:)) Along with various other traditional
usage, it is also used to split and begin
lists, sentences, and quotes.
Table 2:Afan Oromo Punctuation marks
2.6. Techniques Used for Hate Speech Detection

Many academics are using machine learning and deep learning approaches to address the
challenge of detecting hate speech online since these tasks are connected to text
classification tasks. The detection of hate speech is one job in natural language
processing where deep learning-based methods have demonstrated exceptional
performance. The capacity of ensemble deep learning models to improve performance
and solve the shortcomings of individual models has drawn attention to these strategies.
2.6.1 Machine Learning Techniques

Machine learning enables devices to gain knowledge and experience from data and carry
out tasks with proficiency. Data is needed for machine learning algorithms to learn, hence
the database discipline must be connected to it. Algorithms that can learn and adapt the
patterns from huge data are crucial due to the nature of particular issues or the
exponential growth of digital information. Tasks that are too difficult to program are the
two primary issues that require machine learning algorithms because of their capacity to
learn from experience and improve. The other problem is adaptability requiring the
ability to familiarize with user data. ML algorithms find the patterns in the data to better
21
understand the data and help users in decision-making in their day-to-day activities.
There are two types of machine learning: supervised and unsupervised.
Supervised learning: This kind of machine learning (ML) is learned using labeled data. It
makes use of a labeled dataset with matched sets of observed inputs (Xs) and the
corresponding outputs (Ys). The dataset is subjected to a machine learning algorithm to
infer patterns between inputs and outputs.
Unsupervised Learning: In the case of an unsupervised method, we need to let the model
run on its own to uncover knowledge rather than utilizing labeled data. It is employed to
train a model on a dataset with just inputs and analyses the data to give it meaning and to
organize or categorize it. By giving the data some structure, it leverages unlabeled and
uncategorized data to provide recommendations to consumers. It uses input data without
output data to deduce conclusions from the dataset. However, it lacks labeled outputs,
therefore its objective is to deduce the natural organization that exists inside a collection
of data points.
The concept of learning by positive and/or negative feedback underlies the category of
reinforcement learning. It seeks to train a model to identify the most advantageous series
of choices (policy) for resolving a certain issue. Decisions that are successful are
rewarded, while those that are unsuccessful are penalized. In addition to those three
classifications, a lot of well-known algorithms make use of semi-supervised learning,
which as its name implies combines supervised and unsupervised learning. Here, an ML
model is trained using both labeled and unlabeled examples. The bag of words, words,
and character n-gram features are the most successful surface features in the Machine
technique for classifying hate speech. But when it comes to classifiers, we discovered
that SVM, Random Forests, Decision Trees, Logistic Regression, and Naive Bayes are
the most often employed algorithms. Below is a discussion of the most popular machine-
learning techniques for detecting hate speech.
2.6.1.1. Naive Bayes

One of the most popular classification machine learning algorithms, the naive Bayes
Algorithm supports the categorization of data based on the computation of conditional
probability values. It uses class levels expressed as feature values or vectors of predictors
22
for classification and applies the Bayes theorem to the computation. An effective method
for classification issues is the Naive Bayes algorithm. Real-time prediction, multi-class
prediction, recommendation systems, text classification, and sentiment analysis use cases
are all best suited for this approach[35].
2.6.1.2. Logistic regression

Binary logistic regression, sometimes referred to as logistic regression (LR), is a simple
classification method that makes predictions about binary outcomes using statistics.
Finding the model that best captures the connection between the result and a group of
independent variables is the goal of LR. In the LR model, the non-linear logistic function
is used to evaluate the degree of confidence in a result given the independent variables.
Any real-valued input may be used with the logistic function, sometimes referred to as
the sigmoid function, to produce output in the range [0,1], which is then interpretable as a
probability. Values near 1 indicate the first class, while values closer to 0 suggest the
other class, and the result shows the model's confidence in the classification[31].
2.6.1.3. Support Vector Machine

The Structural Risk Minimization (SRM) technique is used to classify high-dimensional
data using the impressive SVM method. As a discriminative classifier, it has been
described as being more accurate than the majority of previous classification prototypes.
It can handle unbalanced classes with high accuracy, but to perform effectively with
noisy datasets, it requires a big data set and longer training times.
2.6.1.4. Random Forest

The random forest algorithm is a data classification algorithm that is flexible and can
work on high-dimensional datasets with a large number of attributes while avoiding
overfitting. In addition to providing reliable feature importance estimates and many other
incidental pieces of information about the dataset aside from the class tags, the random
forest algorithm is also loosely coupled and can be easily parallelized, but it requires
more computational resources for modeling and prediction[6].
2.6.1.5. Decision Tree

The most effective and widely used technique for classification is the decision tree, which
is a supervised learning algorithm. The structure tree used by a decision tree algorithm is
23
one in which each node represents a test on an attribute, each branch indicates the test's
result, and each leaf node serves as a class label. In this method, the class or value of the
target variable was predicted using simple rules using training data. In further detail, the
record began at the decision tree's root and evaluated each branch's attribute to the node
attributes before forecasting a final class label in the leaf node
2.7. Deep Learning Techniques

These days, the phrases artificial intelligence (AI), machine learning (ML), and deep
learning (DL) are all often used to refer to systems or software that exhibit intelligent
behavior[36]. Both the ML subfield and the larger field of AI include deep learning. In
general, ML is a technique to learn from data or experience, which automates the
development of analytical models, whereas AI often incorporates human behavior and
intelligence into computers or systems.
Figure 1:An illustration of the position of deep learning (DL), comparing with machine
learning (ML) and artificial intelligence (AI)[36]
Deep learning also refers to data-driven learning techniques that use multi-layer neural
networks for computing and processing. In the deep learning approach, the word "Deep"
alludes to the idea of numerous levels or stages of data processing before a data-driven
model is created. Deep learning is a machine learning technique that, in general, enables
computers to learn and perform tasks that people would naturally be able to perform.
They can be taught in supervised, unsupervised, or semi-supervised methods, just like
machine learning algorithms.
24
Deep learning techniques Utilise deep artificial neural networks to classify hate speech by
learning abstract feature representations from incoming data through their numerous
stacked layers[37]. The input might be either the raw text data itself or any of the feature
encoding formats that are utilized in the traditional approaches. The main distinction is
that in such a model, the input characteristics could not be employed immediately for
categorization. Instead, new abstract feature representations that are more successful at
learning may be learned from the input using the multi-layer structure. Due to the
network structure's meticulous design, deep learning-based approaches often move their
attention from manually designing features to automatically extracting valuable features
from a basic input feature representation. Indeed, there is a significant movement in the
literature towards the use of deep learning-based techniques, and studies have
demonstrated that they outperform conventional techniques on this task. Nowadays, deep
learning models excel in text analytics and hate speech detection tasks in particular
because of their reliance on deep learning neural network classifiers. It makes sincere
efforts to learn how to recognize patterns in the given text and tries to replicate the event
in layers of neurons. The effectiveness of deep learning models depends on the hyper-
parameters and neural network algorithm of choice, as well as feature representation
methods.
2.7.3.1 Neural Networks

Artificial neural networks, often known as neural networks, are pattern-recognizable
models that are intrigued by how the human brain functions. Its method of operation
resembles how neurons in the human brain transmit signals to one another. The core of
deep learning algorithms is ANN, a subset of machine learning. Input, output, and one or
more hidden layers make up an artificial neural network (ANN).
25
Figure 2:Architecture of Deep learning Neural Network[7]
Due to their great accuracy, deep learning algorithms have recently received a lot of
attention in the text classification problem. For the classification of texts, the following
deep learning methods are employed:
2.7.2. Recurrent Neural Network (RNN) and its Variants

Recurrent neural networks, or RNNs, are a particular type of Artificial Neural Network
(ANN) that is specially made to handle time-series or sequential data. They use the
results of earlier stages as the basis for the current step's input. They can recognize
patterns and forecast future occurrences thanks to this architecture's ability to record and
comprehend the data's sequential nature. RNNs use a hidden state to remember
information about the sequence in order to do this, making sure that pertinent information
is kept and taken into consideration.
Given input state x t , previous stateht −1 , new hidden state and output at time step t are
computed as:
26
ht=σh(Whxt +Uhht−1+bh)
yt=σy ¿
 Where xt = is the input vector at a time step, ht is the hidden layer vector,
and yt is the output vector at time step t.
 W, U, and b are parameter matrices and vectors.
 σh , σy are activation functions.
In RNN hidden layers are recurrent layers, where every neuron in the hidden layer is
connected. The hidden layer takes input from both the input layer xt and the hidden layer
from the previous stateht−1. Recurrent neural networks are intended for modeling
sequences and are capable of remembering prior knowledge, in contrast to feed-forward
neural networks, which receive fixed-size vectors as input and produce fixed-size vector
outputs.
However, short-term memory is a challenge for recurrent neural networks (RNNs). They
will struggle to transfer the information from earlier time steps to later ones if the input
sequence is sufficiently lengthy. This issue was resolved by LSTM and GRU.
2.7.3. Long Short-Term Memories (LSTM)

These particular neural network types were created to perform effectively in situations
where there is a long-term reliance and a sequence data set. When one wants a network to
retain information over a longer length of time, these networks can be helpful. An LSTM
is made up of a group of similar cells, each of which handles input in a certain way. Each
cell also takes input from the cell that came before it in the chain, in addition to
information from external sources. This cell design makes it easier for the LSTM to
retain previous information for a longer period of time[38].
2.7.4. Bidirectional Long Short-Term Memory

The information that an LSTM in its typical state has already processed can be
remembered or referred to. However, it lacks any supporting documentation for the
information that was provided after the point was reached. This turns into a significant
disadvantage when working with sequence data, particularly text. Another LSTM variant
that may retain data from both directions is called bi-directional LSTM. We use two
27
methods for back propagation in the bidirectional LSTM. From the front and the back,
respectively. Because of this procedure, Bi-LSTTM is an effective tool for studying
textual data[38]. The Bi-LSTM architecture is depicted as follows:
Figure 3: Architecture of Bidirectional LSTM[39]
2.7.5. Gated Recurrent Unit (GRU)

Cho et al. invented the Gated Recurrent Unit (GRU), a well-liked variation of the
recurrent network that uses gating techniques to regulate and manage information flow
between cells in the neural network. The GRU is similar to an LSTM but has fewer
parameters because it contains reset and update gates but not output gates. A GRU has
two gates (the reset and update gates), but an LSTM has three gates (the input, output,
and forget gates). This is the main distinction between a GRU and an STM. Due to the
nature of the GRU, dependencies from lengthy data sequences may be captured
adaptively without losing information from previous portions of the sequence. GRU is a
little more simplified variation that frequently provides equivalent performance and is
computed much more quickly. In general, GRU features an extra reset gate, merges the
forget gate and input gate into one update gate, and is becoming more and more popular
for NLP[40].
28
2.7.6. Convolutional Neural Network (CNN)
A well-liked discriminative deep learning architecture, the convolutional neural network
(CNN), learns directly from the input without the requirement for manual feature
extraction. As a result, CNN improves the construction of conventional regularized MLP
networks that resemble ANNs. Each layer of CNN decreases model complexity while
taking into account the ideal parameters for a meaningful output. Additionally, CNN
employs a "dropout" to address the issue of over-fitting that may arise in a conventional
network.
Since CNNs are specifically made to handle a variety of 2D shapes, they are commonly
utilized in visual identification, medical image analysis, image segmentation, natural
language processing, and many other applications. Since it can automatically detect
important components from the input without requiring human involvement, it is more
effective than a normal network. According to their learning capacities, the region's many
CNN variants, such as visual geometry group (VGG), AlexNet, Xception, Inception,
ResNet, etc., may be applied in a variety of application fields[36]. By utilizing word2vec
pre-trained vectors as the primary tools for vector representation of words and character
n-grams, CNN a standard multilayer neural network is discovered as a promising solution
for hate speech identification issues in social media datasets. By preserving key elements
of various pooling strategies that might aid to lower outputs, CNN employs pooling to
minimize the complexity with reference to computing resources by reducing the size of
the output from the first layer to the next layer in neural networks. CNN has been
successfully utilized for hate speech identification despite being designed for image
processing. [6]. In general, it is a crucial technique for classifying sequential, text, and
string data.
2.8. Feature Extraction Techniques

Usually, texts are unstructured data. The unstructured nature of the text input must be
converted into a structured feature space, though, as mathematical modeling is a crucial
element of all ML methods.
The process of obtaining valuable and distinctive qualities, known as features, from input
data (in this case, text) is referred to as feature extraction. These characteristics are
29
gathered to improve how well deep learning systems perform on learning and
classification tasks. Following extraction, a subset of characteristics often holds more
useful data[6].
In text classification, a process known as feature extraction is used to choose a subset of

the terms that are included in the training dataset. This procedure accomplishes two key
goals. First off, employing a classifier to reduce the quantity of useful vocabulary
improves training effectiveness. Second, feature selection enhances detection accuracy by
removing distracting and unnecessary words. Weighted words, word embedding
approaches, and text mining feature extraction techniques are often used strategies for
obtaining text characteristics in hate speech detection.
2.8.1. Bag of Words

A textual form known as a bag-of-words (BoW) records the frequency of word
occurrences in a document. A vocabulary is created based on the words found in the
training data rather than depending on a fixed list of terms, such as dictionaries. This
method, however, ignores the order of the words as well as their syntactic and semantic
value. Consequently, when words are employed in various contexts, misclassification
might happen. The BoW model ignores the sequence of the words inside a phrase in
favor of word count as a way to express a sentence.
2.8.2. Term frequency-inverse document frequency (TF-IDF)

A word's TF-IDF score determines how pertinent it is to a corpus or series of texts.
Term Frequency (TF): Given a document (d), frequency is the number of times a certain
term (t) appears in that document. Therefore, it makes sense that when a term occurs in
the text, it becomes more relevant.
Number of ×t appears∈a document

TF ( t )=
Total number of the tem∈thedocument
Inverse Document Frequency (IDF): The primary function of this IDF is to evaluate a
word's relevance. Finding the pertinent records that meet the requirement is the search's
primary goal. Since tf accords all words equal importance, it is also feasible to utilize
term frequencies to gauge a term's weight in the article.
30
Total number of documents
IDF ( t )=log e ( )
number of document withterm t ∈¿
Although the frequency of a document is the number of distinct documents in which a

word appears, it depends on the entire corpus. The term frequency just counts the
appearances of a term in a single document. Let's now examine how the frequency of the
inverse paper is defined. The IDF is the number of corpus documents divided by the
frequency of the text.
2.8.3. Word Embedding

By capturing the contextual hierarchy, words from the lexicon are mapped to real number
vectors using the feature learning approach known as word embedding. It is a method of
encoding texts in which words with similar meanings are represented similarly. This
means that two related words are represented by very closely spaced, practically identical
vectors. These are necessary for addressing the majority of issues with natural language
processing[41]. Words are represented using a coordinate system, which places related
words closer to one another because of how closely they are connected to one another in
the corpus. An input layer, a hidden layer, and an output layer make up a three-layer
neural network that creates the Word2Vec model, which is a component of the word
embedding technique. The primary goal of this neural network design is to learn the
hidden layer's weights, which stand in for the word embeddings. Word2Vec includes the
Continuous Bag-of-Words (CBoW) model and the Skip-Gram (SG) model as its two
main implementations. While the SG model predicts neighboring words given a single
specified word, the CBoW model predicts a middle word by taking into account several
nearby phrases. The fundamental idea behind the CBoW and SG models is depicted in
Figure 3. The CBoW model selects a selection of input words based on the window size.
For instance, when the window size is two, the matching word (w(t)) is predicted using
four words: w(t-1) and w(t+1), and w(t+2). The SG model, on the other hand, makes use
of a single word to predict nearby terms. The most probable terms to be comparable to
the target keyword are chosen in order to anticipate the words that will come next to it.
31
Figure 4: An illustration of a continuous bag of the word and Skip gram
word embedding model
A paradigm known as GloVe, or Global Vectors for Word Representation, uses word co-
occurrences to create geometric encoding. It is feasible to obtain word embeddings from
the dataset using GloVe, which captures the semantic connections between words[6]. A
paradigm known as GloVe, or Global Vectors for Word Representation, uses word co-
occurrences to create geometric encoding. It is feasible to obtain word embeddings from
the dataset using GloVe, which captures the semantic connections between words.
Various levels of semantic granularity are captured by the various dimensions (25, 50,
100, 200, and 300). The resultant features, referred to as GloVe features, are obtained by
adding the GloVe vectors for each distinct phrase in a text, per dimension.
2.9. Ensemble of deep learning Technique

According to a huge number of research in the field of machine learning, two approaches
dominate the field right now: ensemble learning and deep learning[42]. In the former
approach, the term ‘‘ensemble” refers to methods that weigh and integrate multiple base-
learners in order to obtain a classifier that outperforms them all. The ensemble’s central
idea is to maximize the predictive accuracy by combining the strengths of multiple
baseline classifiers. The concept of creating a predictive model that combines multiple
models has been studied for a long time.
Ensembles of deep learning methods have emerged as a powerful approach for hate
speech detection, leveraging the strengths of multiple models to improve performance
and robustness. Hate speech detection is a challenging task that aims to automatically
32
identify and classify text or speech that contains offensive, discriminatory, or harmful
content. Deep learning techniques, such as neural networks, have shown promising
results in this area due to their ability to capture complex patterns and dependencies in
textual data.
Ensembles in deep learning involve combining the predictions of multiple individual

models to make a final decision. This can be done through various techniques, including
majority voting, weighted voting, or stacking. By aggregating the outputs of diverse
models, ensembles aim to mitigate the weaknesses of individual models and improve
overall performance.
In the context of hate speech detection, ensembles of deep learning methods can be
constructed using various architectures such as recurrent neural networks (RNNs),
convolutional neural networks (CNNs), and transformers. Each individual model within
the ensemble may have different architectural variations, hyper parameters, or pre-trained
embeddings, ensuring a diverse set of classifiers.
Ensemble methods offer several advantages for hate speech detection. Firstly, they can
enhance the generalization ability of the system by reducing overfitting and capturing
different aspects of hate speech. Each model within the ensemble may focus on different
linguistic features or contextual cues, increasing the overall coverage and robustness of
the system. Secondly, ensembles can help mitigate the bias present in individual models.
Hate speech detection is a complex task that can be influenced by various factors,
including cultural, social, and linguistic biases. By combining the predictions of multiple
models, ensembles can reduce the impact of biased decisions made by individual
classifiers and provide a more balanced and fair classification outcome.
Furthermore, ensembles can improve the overall performance metrics of hate speech
detection systems. By leveraging the strengths of multiple models, ensembles can achieve
higher accuracy, precision, recall, and F1-score compared to individual classifiers. This is
particularly important in real-world applications where the consequences of
misclassifying hate speech can be severe.In this section, we give an overview of some of
the most popular ensemble algorithms.
33
2.9.1. Averaging Ensemble
The simplest method for combining the predictions of multiple models is averaging
method[43]. It is a widely used ensemble technique in which each model is trained
separately, and the averaging technique linearly integrates all predictions of models by
averaging them to produce the final prediction. This technique is simple to apply without
needing extra training on huge numbers of individual predictions. Usually, voting is the
standard way for averaging the prediction of the baseline classifiers. The final prediction
results are usually determined by a majority vote on the predictions of many classifiers,
which is referred to as hard voting.
2.9.2. Bagging Ensemble

One of the most popular methods for enhancing the prediction performance of each
model is bagging[44]. The fundamental key idea is to create more diverse predictive
models by adjusting the stochastic distribution of the training datasets. In particular, on
various bootstrap samples of the original training set, the same learning algorithm was
applied, and the final result was achieved by the averaging method. When working with
large high-dimensional datasets, where it is difficult to find a single model that can
perform exceptionally well, the bagging approach is quite difficult.
2.9.3. Meta-learning Ensemble

Meta-learning is an approach that involves acquiring knowledge or learning from
classifiers or models other than the primary model. [45]. Different from traditional
learners, meta-training classifiers go through multiple learning steps instead of just one.
The process begins by training baseline classifiers and subsequently training the meta-
classifier, which combines the predictions from the baseline classifiers. During the
prediction phase, the baseline classifiers provide their classifications, and the final
classification is performed by the meta-classifier. One popular method of meta-learning is
stacking, which involves a two-stage classification structure comprising the baseline
classifiers and the meta-classifiers. This approach is motivated by the limitations of
simple average ensembles, where all models, regardless of their performance, are treated
equally in the ensemble prediction. Stacking, on the other hand, constructs a higher-level
model that combines the predictions of individual models. Specifically, all the models in
the ensemble are trained individually using the same training set, known as the Level-0
34
training set. The predictions generated by each individual model are then aggregated to
form the Level-1 training set. It is important to note that, to prevent overfitting the meta-
learner, the data samples used for training the baseline classifiers must be excluded when
training the meta-learner. Therefore, the dataset needs to be split into two distinct parts.
The first part is used to train the base-level classifiers, while the second part is used to
construct the meta-dataset.
2.9. Related work

This section presents a comprehensive review of basic related works to the area of hate
speech detection on social media to clearly understand the general technique, method,
and result of existing studies.
Some Research has used ML and DL models to Detect hate speech. For example, the
study in [4] developing hate speech identification for Afaan Oromo social media is
critical to eliminating the risk of hate speech on social welfare. They ran six experiments
using ML approaches like support vector machine (SVM), multinomial Nave Bayes
(MNB), linear support vector machine (LSVM), logistic regression (LR), and random
forest (RF) classifier to build hate speech detection prototypes for Facebook and Twitter.
Despite the fact that they constructed the Afaan Oromo hate speech detection model
using ML methods and data from the Facebook and Twitter networks, the study only
looked at posts and comments in textual documents. Posts and comments including
images or photos, audio or video data, or both have not been evaluated. To evaluate
performance, researchers used performance parameters such as accuracy, precision,
recall, and f1-score. Bigram and term frequency-inverse document frequency (TF-IDF)
ML feature selection techniques were utilized. The results show that LSVM has the best
performance. As a result, the researchers agreed to implement the Afaan Oromo hate
speech detection model using LSVM. The most significant limitation of this work is the
use of traditional ML algorithms that need manual labeling of the dataset. The data
experiments were minimal in scale. Going beyond traditional ML methodologies for
experiments, according to those researchers, could be the next study. As a result, we
intend to use deep learning algorithms in this study.
35
Another author[9] created a number of models that were used to detect and categorize
Afaan Oromo hate speech on social media by combining several machine learning
algorithms with feature extraction techniques like as BOW, TF-IDF, word2vec, and
Keras Embedding layers. They were able to collect a total of 12,812 posts and comments
from Facebook by focusing on four thematic categories of hate speech, such as gender,
religion, race, and offensive speech. The following is how the author generalizes his
work: Bi-LSTM with pre-trained word2vec feature extraction is a superior approach, with
accuracy scores of 0.84 and 0.88 for eight and two classes, respectively.
The author of [8] conducted a comparison of deep learning-based Afaan Oromo hate
speech detection methods. They begin by extracting a corpus of Facebook and Twitter
comments and posts using word n-grams and word embedding methods such as
Word2Vec. There were 35,200 posts and comments in all. The entire dataset size was
expanded to 42,100 by using the data augmentation approach. Convolutional neural
networks (CNN), long short-term memory networks (LSTM), bidirectional long short-
term memory networks (Bi-LSTM), GRU, and CNNLSTMs are then employed for hate
speech detection. The experiment results show that the model developed with CNN and
Bi-LSTM has the greatest weighted average F1-score of 87%. The author suggests that
future research look into the performance of classifier ensembles and meta-learning for
this task. Furthermore, the performance is not perfect, which means that users may
encounter misclassified content. As a result, we used classifier ensembles and meta-
learning to boost performance.
Author[40], developed an RNN-based automatic Amharic hate speech post and comment
detection algorithm. The author used 30,000 datasets, with 80% used for training, 10%
used for validation, and 10% used for testing. The author employed two deep learning
algorithms (LSTM and GRU) to extract word n-gram and word2vec features. Finally, the
author obtained the greatest accuracy of 97.9% using the LSTM algorithm and word2vec
feature extraction techniques. However, the author provides binary classification classes,
which is insufficient. Furthermore, the author has only employed a few Algorithms
(LSTM and GRU).
36
A study introduced in[5] developed an Apache spark-based model to classify Amharic
language Facebook posts and comments into hate and not hate. They employed Random
Forest and Naive Bayes as learning algorithms and Word2Vec and TF-IDF for feature
extraction using 6,120 (4,882 to train the model and 1,238 for testing). In their
experiment, Naive Bayes and Random Forest outperform with an accuracy of 79.83%
and 65.34% with the word2vec feature vector modeling approach respectively. However,
they recommended expanding the classification category with different aspects of hate
and increasing the corpus size including other sources.
[46] An Italian online hate campaign was suggested on a social network, utilizing data
collected from public Facebook pages in Italy. The collected dataset was categorized into
three classes: "no hate," "weak hate," and "strong hate." To create a second dataset, the
"weak hate" and "strong hate" classes were combined into a single "hate" class. The
author developed and implemented two classifier algorithms, SVM and LSTM,
specifically designed for the Italian language. These algorithms incorporated morpho-
syntactical features, sentiment polarity, and word embedding lexicons. Two separate
experiments were conducted using both datasets, ensuring at least 70% agreement among
annotators regarding the data's classification. The SVM algorithm achieved an F1-score
of 80% for binary classification, while the LSTM algorithm achieved a slightly lower F1-
score of 79%. For ternary classification, SVM and LSTM achieved F1-scores of 64% and
60% respectively. The study only utilized conventional SVM and LSTM models, but
employing additional deep-learning models could potentially enhance the classification
performance.
[47] Proposed Cyber hate speech detection based on Arabic context over the Twitter
platform, by applying NLP and machine learning techniques. The work focused on a set
of tweets related to sports, racism, terrorism, journalism, sports orientation, and Islam.
The processed dataset is experimented with using Decision Tree (DT), Support Vector
Machine (SVM), Naive Bayes (NB), and Random Forest (RF). In their experiment, RF
with TF-IDF and profile-related features achieved a better result with an accuracy of
91.3%. Since hate as a term is subjective and can be expressed in a wide range of areas
37
not restricted to the sport, religious or racial issues they recommended further work for
the more generalized dataset and effective detection models.
Likewise, the deep learning approach for automatic cyber hate speech detection on
Twitter is presented by [47]. The dataset for the study was collected from Twitter and the
collected data captures different hate expressions in the Arabic region. The authors used
word embedding mechanisms for feature extraction. The hybrid of CNN and LSTM
networks is implemented for model development. The proposed approach aimed to
classify tweets as hate and normal and achieved promising results, 66.564%, 79.768%
65.094%, and 71.688% regarding the accuracy, recall, precision, and F1 measure
respectively. The study is limited to binary classification while it is important to
distinguish offensive expressions from hate speech. The study also recommended a more
standardized dataset and high-performance deep learning approaches.
Gambäck et al. [48] present a deep learning-based hate speech classification system for
twitter using a dataset prepared by Benikova et al. [69] that has four class categories
sexism, racism, both (sexism and racism), and not hate speech. Using four features
embedding like word2vector, character n-grams, Random vector, and combined character
n-grams with word vectors. These four features embedding with deep learning CNN, the
models tested by using 10-fold cross-validation, their best-performing model uses
word2vec embedding with higher precision and recall, with a 78.3% F-score.
Several studies have demonstrated the effectiveness of ensemble deep learning in hate
speech detection. For instance, Smith et al. (2020) developed an ensemble model by
combining the predictions of CNNs, LSTMs, and BERT models, achieving superior
performance compared to individual models. The ensemble model effectively captured
both local and contextual features, enhancing its ability to detect subtle nuances of hate
speech.
Furthermore, Gupta and Rajput (2021) proposed a stacked ensemble model that
combined predictions from multiple CNNs and LSTMs. The model leveraged the
complementary strengths of the base models, leading to improved detection accuracy and
robustness across diverse hate speech datasets.
38
Table 3: Summary of related work
Authors Methods Features Data set Langua Contribution Limitations

ge
[4] Machine n-gram and 13600 Afaan They proposed the author
learning TF-IDF commen Oromo a model that only applies
approach ts detects Afaan conventional
Oromo hate machine
speech on learning
social media algorithms that
using a need manual
combination of labeling of the
n-gram and dataset with
TF-IDF feature binary
extraction classification
methodologies
[9] Machine BOW, TF- 12,812 Afaan Proposed This work is
learning IDF, posts Oromo Afaan Oromo limited to
and Deep word2vec, and hate speech Afaan Oromo.
learning and Keras commen detection and It doesn’t
techniques ce Classification consider Multi-
model lingual. Only
categories of
hate speech are
detected.
[8] Deep Word2vec 42000 The Only Afaan
learning posts performance of Oromo
and the model was comments are
commen improved extracted from
ce social media so
the hate speech
data set was
39
not for Multi-
purpose.
The
performance of
the model is
not good
enough. The
author did only
binary
classification.
[40] Deep n-gram and 30,000 Amhari proposed hate This work is
learning word2vec datasets c speech limited to hate
approach(L detection speech in the
STM and model for the Amharic
GRU) Amharic language and
language also binary
classification
problems.
Only done
LSTM and
GRU model.
[48] Deep word2vec 6,655 English They apply This study
learning( total CNN on the produced a
CNN) datasets problem of small data set
multi-class of 6655 which
classification: was biased and
racism, sexism, needed class
both (racism balancing.
and sexism),
and neither
[20] Machine Word2Vec 6,120 Amhari proposed the This work is
40
learning(Ra and TF-IDF posts c application of limited to hate
ndom apache spark in speech
forest and hate speech detection in
Naïve detection to Amharic.
Bayes) reduce the
Challenges
[2] Machine word n- 16,500 Indones They proposed The author
learning gram, posts ian a model that only applies
Approach( character detects conventional
RFDT,NB, n-gram, Target, machine
SVM) orthography category, and learning
lexicon level of hate algorithms.
speech.
Table 4:Summary of Related works
41
Chapter 3: Methodology
3.1 Introduction
This chapter describes the methodologies that we used to accomplish this research,
including methods to implement the models, literature review,data collection,data
preparation, software and hardware configuration of the system used, and evaluation
techniques used to evaluate the model.
3.2 Research flow
In this study,experimental research methods were followed. To achieve the objectives of

this research, we passed through the following process flow(figure ).As shown in the
following block diagram, this research was conducted in three main phases. The first
phase includes identifying the domain of the problem, which means understanding the
problem by reviewing different types of literature. Then, the research objectives are
formulated, including general and specific objectives. The second phase concerned the
data preparation and design of the research. During data preparation, data were collected
from Facebook, Twitter, and YouTube, labeled using four annotators, including the
researcher, and finally split into training and testing sets. After the data preparation, the
model was designed. The third phase involved the implementation of the model. In this
phase, the designed model is implemented using appropriate tools and methods. The
designed model was trained and tested using the appropriate data. During training, the
performance of the model was evaluated, and the model was tested with test data.
Diagram
3.3. Methodology
Several methods(techniques)are proposed for the successful completion of this study.
3.3.1 Literature Review
42
Related literature must be reviewed to understand hate speech detection systems. To
obtain resources, we performed the following tasks. We reviewed relevant local and
international journal articles, conference papers, books, and resources on the Internet
related to hate speech detection based on textual data as well as machine learning
techniques to gain a conceptual understanding and identify research gaps in the study.
3.3.2 Data collection and preparation
The Afaan Oromo text dataset for hate speech detection, which is the main focus of the
analysis in this paper, was retrieved from comments and posts published on Facebook,
Twitter, and YouTube from January 2023 to May 2023.
This work targets Facebook pages, Twitter accounts, and YouTube, which are open to
suspected hate speech, rather than focusing on websites or blogs that already have
specific agendas. In Ethiopia, it is common for social network communities to post on
political and religious issues. Whereas many users use a different language to create or
share information, only Afaan Oromo data will be considered in this case. All posts and
comments collected from different pages should be from Afaan Oromo. In the data
collection process, posts and comments were collected using Face Pager, a Facebook API
application that can download posts and comments from the desired page in the CSV file
format. The following features of Hate Speech were included in the dataset:
 Free speech
 Race
 Religious
 Politics
 Offensive
43
4.5. 3.2. Data source
The data source for building the Afaan Oromo hate speech and offensive language
dataset is Afaan Oromo social media pages, which have more followers. The social
media pages are randomly selected from the higher numbers of followers because a social
media page with a higher number of followers is believed to have different users with
different point views as well as from well-known Afaan Oromo media. Several sampling
metrics or criteria can be used to select pages or users on social media platforms. This
study chose the following metrics or criteria for selecting a public page:
 A page that posts news or comments on issues of religion, ethnicity, politics, and
gender calls for violence daily.
 The number of likes and followers of a page must be greater than 10,000, which
allows for more active public pages.
 A page that uses the Afaan Oromo language most frequently for posts and
comments.
De pending on the above page selection metrics, we collect data from different public
pages
The summary of pages that were utilized to build the corpus is provided in Table 1.
Those pages listed in Table 1 typically post discussions on political, social, economic,
religious, and environmental issues that took place in Ethiopia. In total, 35,000 posts and
comments were collected. To remove the noise from the data set, rigorous preprocessing
was carried out, which resulted in the removal of HTML, URLs, tags, emoticons, and
other language scripts.
Based on these criteria, the selected pages and the number of data collected from each
page are given in the table below:
No Page/Account Name Collected text data
1 FBC Afaan Oromoo 7,024

2 Oromia Prosperity Part/OPP 4790
3 Oromia Broadcasting Service – OBS 4570
44
4 Taye Dendea Aredo 6200
5 BBC Afaan Oromo 4560
6 Oromia Media Network 6000
7 Jawar Mohammed 4600
8 VOA Afaan Oromoo 4500
Total number of data filtered 37,244
Total Number of Unique Data Filtered After 35,000
Removing Redundancy
3.3.1. Data Annotation

Annotation is an integral part of the development of Text classification. Annotated data
provides useful quantitative information about the occurrence of certain contextual
features. For the Afaan Oromo, there is no standardized and labeled corpus for hate
speech detection. In hate speech detection, dataset annotation can be performed either
manually or by crowd-sourcing. Research work in [49] shows annotations generated by
expert annotators outperform crowd-sourcing strategy.
The annotation task for the multilabel hate speech dataset involved assigning multiple
labels to each data instance from a set of five classes: free speech, religion, race, politics,
and offensive content. To complete this task,four annotators, including the researcher,
were involved. Each annotator was responsible for reviewing and annotating the texts and
considering all relevant classes that were applied to each instance. The annotators'
selection was based on their background knowledge and expertise in analyzing hate
speech. The dataset was comprehensively labeled with appropriate class labels by
leveraging the combined efforts of these four annotators, allowing for a more nuanced
understanding of hate speech across multiple dimensions. The annotation task was
performed following the provided annotation guidelines. The annotation guidelines
served as a set of instructions and criteria to determine how each instance in the dataset
should be labeled. To ensure consistency among annotators, they were provided with the
following guidelines (or rules) for each annotator: A post has been marked as Politics,
Race, and Religion-related speeches:
45
If at least one of the following criteria (1-5) is achieved, the annotators classify the
sentence (speech) into politics, religion-class, and race-class hate.
If the post/comment promotes hatred or encourages violence by discriminating political

views, religion, and race.
If the post/comment promotes hatred directly or indirectly discriminating by the political

view, religion, and race.
If the post/comment promotes violence or injury action is to be taken against someone

discriminating by political view, religion, and race.
If the post/comment is mimic some unnecessary thing/device/animal and others to dis-

courage psychological moral by discriminating by a political view, religion, and race.
If the post/comment is insulting by discriminating against political views, religion, and

race.
If the above criteria are fulfilled the class dataset is labeled as (politics, religion, and race)
class hate speech, otherwise labeled as neutral speech.
Offensive languages class: If at least one of the following criteria (1-4) is achieved, the
annotators can classify the sentence (speech) into offensive-class hate speech.
If the post or comment contains a common insult without specifying the other 3 classes
above.
If the post or comment contains insult but it may or may not promote to take a violent
attack on an individual or group.
If the post or comment contains an upsetting word but the particular subject of the
sentence is unknown (sentence having hidden subject/noun/pronoun).
If the comment contains imprecation words.
Depending on the above guide lines we assigned multiple labels to each data instance
from a set of five classes.Detailed statistics of the balanced Afaan Oromo dataset
46
categorized into 5 classes (i.e. Free, Race. politics, Religion, and Offensive Language)
are provided in Table 2.
Class Label Number of texts

Free 0 10,000
Religion 1 5,000
Race 2 7,000
Politics 3 6,000
Offensive 4 7,000
3.4. Data preprocessing

Preprocessing is the most significant and preliminary task of feature
engineering approaches in many NLP applications that are used to clean and
prepare datasets for further processing. The main role of data preprocessing is
formatting or normalizing the input data so that later tasks can be computed
easily. The text preprocessing component handles different language-specific
issues imposed by the nature of the language to prepare the data for the
remaining phases. To obtain good results, language-dependent text preprocessing
should be performed before hate speech detection is implemented. Text
preprocessing is the step by which the text is made comfortable with the learning
algorithm or any other component that requires the relevant text to proceed to the
next level or perform its action accordingly. The preprocessing step is simply the
removal of non-informative words or characters from the text to save the
computational resources of the systems, as well as enhance its performance by
eliminating unwanted jumbles. These steps are needed to transfer text from the
human language to a machine-readable format for more processing.
Normalization
Normalization is a process that transforms a list of words to a more uniform
sequence. This is useful in preparing text for further processing. By transforming
the words to a standard format, other operations are able to work with the text
and will not have to deal with issues that might compromise the process. The
normalization process can increase text matching. After a text is acquired, we
47
start with text normalization. Text normalization includes:
Convert text to lowercase: - Lowercasing all Afan Oromo text data, although
commonly overlooked, is one of the easiest and most useful forms of text
preprocessing. It is applicable to most text mining and NLP problems and can
help in cases where your corpus is not very large and meaningfully helps with
consistency of expected output. For example, if trainee a word embedding model
for similarity lookups. It is easy to found that different variation in input
capitalization (e.g. ‘Paartii’ vs. ‘paartii’) gave different types of output or no
output at all. This was probably happening because the dataset
48
had mixed case occurrences of the word ‘Paartii’ and there was insufficient
evidence for the neural- network to effectively learn the weights for the less
common version. This type of issue is bound to occur when the dataset is
honestly small, and lowercasing is a great way to deal with sparsity issues.
Remove numbers: - numbers are removed if they are not relevant to the
analyses. Typically, regular expressions are used to remove numbers.
Remove Punctuation: - Punctuation also will be removed. Punctuation is
basically the set of symbols [!” #$%&()*+, ./:;<=>?@[\]^_{|}~]:
Remove HTML Tags: - since, our dataset is web scraped, there is chances
that our dataset will contain some HTML tags. Since these tags are not useful
for our natural language processing tasks, it is better to remove them.
Tokenization
Tokenization is the process of tokenizing or splitting Afaan Oromo string,

text into a list of tokens. One can think of token as parts like a word is a
token in a sentence, and a sentence is a token in a paragraph. This is the task
of splitting Afaan Oromo texts in to piece of tokens, which are disjoint and
meaning full texts. Sometimes it can be defined as given a sequence of Afaan
Oromo characters and a defined document unit, tokenization is the task of
chopping it up into pieces, perhaps at the same time throwing a way certain
characters such as punctuations. A token is an instance of a sequence of
Afaan Oromo characters in some particular document that are grouped
together as a useful semantic unit for processing.
In order to get our computer to understand any Afaan Oromo text, we need to
break that word down in a way that our machine can understand. That’s
where the concept of tokenization in Natural Language Processing comes in.
As tokens are the building blocks of Natural Language, the most common
way of processing the raw text happens at the token level. For example,
Transformer based models the State of The Art neural network architectures
in Natural Language Processing (NLP) process the raw text at the token
level. Similarly, the most popular neural network architectures for NLP like
RNN, GRU, and LSTM also process the raw text at the token level. Hence,
tokenization can be broadly classified into three types’ word, character, and
sub word (n-gram characters) tokenization [27]. In our work, word level
1
tokenization is used.
Word Tokenization: - it is the most commonly used tokenization algorithm.
It splits a piece of a given text into individual words based on a certain
delimiter. Depending upon delimiters, different Afaan Oromo word level
tokens are formed. Pre-trained Word Embedding such as Word2Vec and
GloVe comes under this type of tokenization. `
3.5. Splitting Dataset
For the experiment, the dataset is divided into two sets, i.e., Training Set and Testing
Set. In this research, we split our dataset into Training Set and Testing Set with a ratio
of 80:20, respectively. The training set is used to train and optimize models. The
testing set (unseen set) is used to evaluate models.
3.6. Feature Engineering

Once the raw data undergoes preprocessing, the Feature engineering part will
transform each token into its corresponding embedding vector (a vector that
summarizes a given token into a new representation that captures the contextual
similarity). Deep neural networks can capture features automatically without any
human intervention but they are designed to learn from numerical data or word
vectors. Among the available word embedding techniques, a research in [50] reveals
that a word2vec model is found to be the effective distribution for hate speech
detection research. So, in this study, a type of word embedding technique called
Word2Vec is used, which is an algorithm that uses a neural network to learn word
embedding. Its goal is to estimate each token’s position in a multi-dimension vector
space based on the similarity of different tokens. There are numerous variants of
word2vec, among those the Continuous Bag-of-words (CBOW) model was adopted
for this study. In the CBOW model, the model learns the distributed representation by
training a feed-forward neural network using word co-occurrence with language
modeling to predict the word in the given context.
T
1
∑ log p ( wt ∨¿wc )
t t=1
Where w in Eq. 1 is the target word, and wc represents the sequence of words in
context. Word2Vec model can be implemented in two ways: (1) pre-training and
2
using it as an input layer at the beginning of the model architecture, or (2) training it
with the model itself.
3.7. Feature Learning and classification

Once the feature engineering module produces the embedding vectors, the next step is
feature learning and classification. In this study, traditional machine learning
algorithms specifically SVM,LR,RF and deep learning algorithms such as
CNN,BILSMT,LSTM,And GRU are selected for feature learning and classification.
3.8. The proposed system Architecture

In this work, our stacked ensemble model is developed using two levels: Base
Learners and Meta classifier, as shown in Figure 3.2. Base learners begins by loading
the training models of CNN, Bi-LSTM, LSTM, and GRU, and the output of each
model is used as input to Meta classifiers. At Meta classier level, SVM, LR and RF,
are trained.
3.9. Implementation Environment

To perform data preparation, data preprocessing and data analysis using the neural
network model different tools and libraries are used to support our model
implementation. In this research, the python programming language and PyTorch
framework are used for Machine learning algorithms implementation with different
3
data processing tools because python is the choice of developers, researchers, and data
scientists.
Implementation tools
Tools Version Description
Anaconda 1.9.12 It is a graphical user interface (GUI) of desktop that is
navigator included in Anaconda® distribution to allow us to manage
conda packages easily, launch applications, environments,
and channels without using command-line commands.
Jupiter 6.0.3 An open-source web application that allows us to create and
notebook share documents that contain visualizations, live code,
equations, and narrative text. Uses include numerical
simulation, data visualization, statistical modeling, data
cleaning, and transformation.
Python 3.7.0 A general purpose Programming language that is suitable for
deep learning algorithm implementation.
Notepad++ 7.8.6 It is an editor of source code and normal texts to use it on
Microsoft Windows. It supports tabbed editing and we use it
for data preparation and to manage the data annotation
process of annotators.
Pytorch 1.4.0 It is a library for an optimized tensor of deep learning using
CPU and GPU which is an open-source library of machine
learning used for developing and training the neural
network-based deep learning models. We use it to train our
RNN based hate speech text detection model.
NumPy 1.18.1 Array processing for number, strings, and objects. We use it
to handling our data set features for training and testing of
the model.
Matplotlib 3.1.3 Publication quality figures in python. We use it for data
visualization.
4
Deployment Environments
The tools used for implementation were installed on a personal computer DESKTOP-
JPLMC78 equipped with Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz 1.70 GHz,
6.00 GB, 64-bit operating system, x64-based processor.
5
Chapter 4: Results and Decisions
In this chapter four, the result of each proposed model will be presented. Data source,
data preprocessing and data labeling is discussed. Next to that, the experiment and
evaluations of each model will be discussed.
4.1. Data source

We collected 35,000 public comments from Facebook, twitter and YouTube. The
choice of data source is based the availability and publicity of the data. After scraping
using the scraping tool, the dataset is saved in CSV data format in order for
convenient for processing with python code.
4.2. Data Annotation

Data Annotation is the process of labeling or tagging each data class. this is necessary
because supervised machine learning algorithms need labeled data to learn from the
data. For the annotation task four annotators including the researcher were involved.
From the total of 35,000 sentences: offensive 7000, religion 5000, politics 6000, race
7000, and free 10000. Detailed percentages of them are displayed in Figure 5-1.
Figure 4- 1 Statistics of Dataset
6
This corpus contains Afan oromo sentences with appropriate collections of hate
speech from different sources. The diagram provides information about the data: the
total number of sentences and words.
4.6. Data pre-processing

Stop-Word Removal
Stopwords can add noise to the data and make it harder to find meaningful patterns.
By removing stopwords, we can reduce the noise in the data and focus on the
important words that carry more meaning. Removing stopwords can reduce the size of
the dataset and make it more efficient to process. This is particularly important in
applications that involve large volumes of text data, such as web search engines,
social media analysis, and chatbots. Stopwords can sometimes interfere with the
accuracy of text analysis and classification algorithms. By removing stopwords, we
can improve the accuracy of these algorithms and make them more effective in
identifying patterns and insights in the data. Overall, stopword removal is a simple yet
effective technique that can add value to NLP applications by reducing noise,
improving efficiency, and improving accuracy.
Define a list of stopwords to be removed

Convert to lowercase
Split the text into individual words
Loop through each word in the text
Check if the current word is a stopword
If the current word is not a stopword, add it to a
new list of filtered words
join the filtered words back together into a
string.
Return the filtered text as output
Normalization
Normalization is the process of transforming text into a standard format that can be
easily analyzed and compared. In natural language processing, normalization is often
used to correct misspelled words, remove slang words, or standardize the
representation of words with different spellings but the same meaning.
7
In the case of homophones like "baayee" and "baayyee", normalization can be used to
represent both words in a consistent and standardized way. This can improve the
accuracy of natural language processing models that rely on text data by reducing the
number of unique representations of the same word.
Additionally, normalization can help to improve the readability and understandability
of text by removing noise and irrelevant information.
Overall, normalization is an important step in preparing text data for analysis and can
help to improve the accuracy and effectiveness of natural language processing models
Stop word removal
The use of stopword removal is to improve the accuracy and efficiency of natural
language processing (NLP) tasks such as text classification, sentiment analysis, and
information retrieval.
Stopwords are words that are common in a language and do not carry much meaning,
such as "akkuma", "kan", "osoo", "yoom", "fi", "naaf", "hanga", etc. These words
appear frequently in text and can clutter the dataset without adding much value to the
analysis.
By removing stopwords from text data, we can reduce the size of the dataset and
improve the accuracy of NLP models. This is because stopwords can distort the
frequency of important words in the dataset, making it more difficult for the model to
identify meaningful patterns. Stopword removal can also improve the efficiency of
NLP tasks by reducing the processing time required to analyze text data. This is
because removing stopwords reduces the amount of text that needs to be processed,
making the analysis faster and more efficient.
Define a list of stopwords for Afan Oromo language
Tokenize the input text into individual words
For each word in the list of words:
If the word is a stopword, remove it from the list of
words
If the word is not a stopword, keep it in the list of
words
Join the filtered words back into a single string and return it
8
Overall, the use of Stopword removal is an important step in preprocessing text data
for NLP tasks, as it can improve the accuracy and efficiency of the analysis
4.3. Evaluation Metrics

An evaluation matrix or evaluation metric is a tool used to evaluate the performance
of a model or system. In machine learning, it is used to determine how well a machine
learning model is performing in terms of its accuracy and effectiveness. Accuracy,
precision, recall, and F-score are commonly used evaluation metrics in machine
learning, especially in tasks related to classification and information retrieval.
Accuracy measures the overall performance of the model by calculating the
percentage of correct predictions made by the model (Islam, Mercer, & Xiao, 2019).
The formula for accuracy is.
Accuracy = (True Positive + True Negative) / (True Positive + False Positive + True
Negative + False Negative) ---------------------------------------------------------(4.1)
Precision: is the ratio of true positive cases over the total number of cases predicted
as positive. It measures how many of the predicted positive cases are actually positive
(Islam, Mercer, & Xiao, 2019). The formula for precision is:
Precision = True Positive / (True Positive + False Positive) ---------------------(4.2)
Recall is the ratio of true positive cases over the total number of actual positive cases.
It measures how many of the actual positive cases were identified as positive by the
model (Islam, Mercer, & Xiao, 2019). The formula for recall is:
Recall = True Positive / (True Positive + False Negative) -------------------------(4.3)
F-score: is a measure that combines both precision and recall into a single metric. It is
the harmonic mean of precision and recall. The formula for F-score is:
F-score = 2 * (Precision * Recall) / (Precision + Recall) ----------------------------(4.4)
F-score is a useful metric when precision and recall are equally important, and you
want to balance between them.
A confusion matrix is a table that is used to evaluate the performance of a

classification model. It shows the number of true positives, false positives, true
9
negatives, and false negatives. The rows of the matrix represent the actual (ground
truth) values, and the columns represent the predicted values. The confusion matrix is
used to calculate the accuracy, precision, recall, and F-score. The following is an
example of a confusion matrix:
Predicted Positive Predicted Negative
Actual Positive True Positive False Negative
Actual Negative False Positive True Negative
Table 4- 1 Example of Confusion Matrix
In the above Table 4-1 matrix, True Positive represents the number of cases where the
model predicted positive, and the actual value was positive. False Positive represents
the number of cases where the model predicted positive, but the actual value was
negative. False Negative represents the number of cases where the model predicted
negative, but the actual value was positive. True Negative represents the number of
cases where the model predicted negative, and the actual value was negative (Islam,
Mercer, & Xiao, 2019).
In summary, accuracy measures the overall performance of the model, precision

measures how many of the predicted positive cases are actually positive, recall
measures how many of the actual positive cases were identified as positive by the
model, and F-score is a combined metric that balances between precision and recall.
Confusion matrix is a table used to evaluate the performance of a classification model.
4.5.Experiment Results
4.5.1. Experiment with SVM
After pre-processing, we convert the text data into numerical features using TF-IDF.
We split the dataset into training and testing sets using an 80/20 split ratio. We then
train a SVM multi-label classifier using OneVsRestClassifier (Pedregosa, 2011) and
evaluate its performance on the testing set using metrics such as accuracy, precision,
recall, and F1-score. We trained a support vector machine (SVM) model on our hate
10
speech dataset and achieved an accuracy of 78%, precision of 78%, recall of 77%, and
an F1-score of 80%.
4.5.2. Experiment with Logistic Regression
The same to other machine learnings, we used the logistic regression classifier from
the scikit-learn library. The following hyperparameters are used:
- C: The regularization parameter that controls the strength of regularization. It is a

positive value that determines the trade-off between fitting the training data well and
avoiding overfitting. Smaller values of C increase the regularization strength, while
larger values of C decrease it.
- Penalty: The type of regularization to use. We will use L2 regularization, which adds
a penalty term to the loss function that is proportional to the squared magnitude of the
weights.
- Solver: The optimization algorithm to use. We used the 'liblinear' solver, which is
efficient for small datasets and supports L1 and L2 regularization.
- Multi-class: The method to use for handling multi-class classification. We will use
the 'ovr' (one-vs-rest) strategy, which trains a separate binary classifier for each class
and makes predictions based on the highest probability. A logistic regression (LR)
model on the same dataset and achieved an accuracy of 77.7%, precision of 79.3%,
recall of 76.6%, and F1 score of 81.0%.
4.5.3. Experiment with Random Forest
After repeating the same steps as SVM, here, we trained a Random Forest classifier
on the training set using the scikit-learn library. Random Forest is a suitable algorithm
for multi-label classification tasks, as it can handle multiple labels for each sample
without the need for the MultiOutputClassifier wrapper. We trained a Random Forest
(RF) model on the dataset and achieved an accuracy of 82%, precision of 82%, recall
of 79.8%, and F1 score of 80%. The confusion matrix showed that the model had
lower precision and recall scores for the politics and race intolerance categories
compared to the SVM and LR models.
11
4.6.3. Experiment with Ensemble CNN
To develop a model using CNN we used the following configurations;
Hyper parameter training Parameters

Embedding layer Input dimension = 53475, output
dimension=300
Dropout rate 0.5
Convolutional Layers 1
Pooling layer GlobalMaxPooling1D
Activation functions Relu and softmax(at output layer)
Epochs 20
Batch size 64
Optimizer Adam
Training/Test size 80 % and 20 %
Table 4- 2 Model Configuration of CNN
The input embedding dimension indicates the size of vocabulary of our data which is
53473. The dropout is 0.5 which indicates that, the 50% of the neuron is dropped
which is irrelevant. We have one convolutional layer with filter size, kernel size and
activation, which is 128, 2 and RELU is our case.
This is experimental result of CNN
Algorithm Hate speech Evaluation metrics

classes precision recall F1_score Accuracy
CNN+word2vec Free
Offensive
Race
Politics
Religion
Table 4- 3 Result of CNN
The proposed CNN achieved accuracy of 85% with 86%, 84%, 85% precision, recall
and f1_score respectively.
12
Figure 4- 3 Confusion Matrix of CNN model
4.5.4. Experiment with Ensemble CNN+SVM
In this experiment, we created an ensemble model using Convolutional Neural

Network (CNN), Support Vector Machine (SVM), and Word2Vec to classify hate
speech text data. The combined SVM model with a convolutional neural network
(CNN) model with word2vec feature and achieved an accuracy of 95%. The result is
same to Ensemble CNN+LR, except for precision and recall of Religion and Politics
classes. Over all this ensemble model outperformed the SVM model alone, especially
in detecting all classes.
Algorithm Hate speech classes Evaluation metrics

precision recall F1_score Accuracy
CNN+SVM+ Free 0.94 0.96 0.95 0.95
word2vec
Offensive 0.95 0.95 0.95
Race 0.96 0.93 0.94
13
Politics 0.96 0.97 0.96
Religion 0.97 0.92 0.95
Table 4- 4 Result of Ensemble CNN+SVM
Figure 4- 4Confusion Matrix of Ensemble CNN+SVM
4.5.5. Experiment with Ensemble CNN+LR
We used a combination of Convolutional Neural Networks (CNN), Logistic

Regression (LR), and Word2Vec embeddings to ensemble model.
We have Trained Word2Vec on 60,000 words likely to provide a rich and diverse set
of word embeddings that captured the semantic relationships between words in the
corpus of our text data. These embeddings could then be used as input features for the
CNN and LR models, which can learn to classify hate speech texts based on these
features.

CNN+ LR Free 0.91 0.96 0.93 0.95
+word2vec
Offensive 0.94 0.93 0.94
14
Race 0.96 0.93 0.94
Politics 0.96 0.96 0.96
Religion 0.98 0.92 0.95
Table 4- 5 Result of Ensemble CNN+LR
The CNN model is particularly effective at learning local and global features in text
data, such as n-grams and sentence structures. The LR model, on the other hand, is a
linear model that can create decision boundaries that separate different classes of text
data. By combining these models with Word2Vec embeddings, we create a more
powerful and accurate model that outperformed individual models. The LR model
with a CNN model achieved an accuracy of 95%, as the other metrics are mentioned
in the table above. The ensemble model of CNN and LR showed a very impressive
performance compared to the individual models.
Figure 4- 5 Confusion Matrix of Ensemble CNN+LR
15
4.5.6. Experiment Ensemble CNN+RF
Finally, we used a combination of Convolutional Neural Networks (CNN) and

Random Forest (RF). As stated before, the CNN model is particularly effective at
learning local and global features in text data, such as n-grams and sentence
structures. The RF model, on the other hand, is an ensemble learning algorithm that
uses a collection of decision trees to make predictions.

CNN+RF Free 0.94 0.96 0.95 0.95
+word2vec
Offensive 0.95 0.95 0.95
Race 0.96 0.93 0.95
Politics 0.96 0.96 0.96
Religion 0.97 0.93 0.95
Table 4- 6Result of Ensemble CNN+RF
By combining these models with Word2Vec embeddings, we aimed to create a more

powerful and accurate model that outperformed individual models. RF model with a
CNN model achieved an accuracy of 95%. The ensemble model showed very good
performance compared to the RF model alone.
16
Figure 4- 6 Confusion Matrix for Ensemble CNN+RF
The accuracy result is same with CNN+LR model. However, there are some
differences in other metrics. The ensemble CNN+RF achieved better result specially
for free, offensive and race classes than CNN+LR. In general ensemble models
showed improved performance compared to the individual models, indicating that
combining different machine learning models can lead to better results in hate speech
detection.
4.5.7. Experiment with GRU
To develop GRU (Gated Recurrent Unit) model for Afan Oromo Hate speech, first we
have to mention out a model configuration, hyperparameter tuning, and evaluation to
achieve the best results. For the GRU, the following model configurations is used.
The input dimension of the embedding layer is 53475 as mentioned in table below.
The embedding is pre-trained word2vec with about 60,000 word-embedding.
Parameter Values
Embedding dimension Input dimension = 53475, output dimension=300
Dropout rate 0.4
Number of GRU layers 3
17
Epochs 30
Batch size 64
Optimizer Adam
Table 4- 7 Model Configuration (parameter setting) of GRU
The embedding layer takes vocabulary size, weight from pre-trained word2vec
embedding matrix, output dimension and maximum length of input sequences in our
dataset. The GRU layer has 3 layers units and a dropout rate which is set to 0.4, to
prevent overfitting. The last dense layer consists of 5 units, which corresponds to the
number of classes in our problem. The last layer is the SoftMax activation function to
generate probabilities for each class. The model is trained using the
categorical_crossentropy loss function, optimized with the Adam optimizer, and
evaluated based on accuracy.

GRU+word2vec Free
Offensive
Race
Politics
Religion
Table 4- 8 Result report of GRU
The proposed GRU achieved accuracy of 85% with 86%, 84%, 85% precision, recall
and f1_score respectively.
18
Figure 4- 7 Confusion Matrix of GRU Model
4.5.8. Experiment with GRU+SVM
When combining the GRU and SVM, it achieved an accuracy of 90%. Indicating that
it correctly predicted 90% of the instances. In terms of recall, the model achieved a
value of 0.89, which means that it correctly identified 89% of the positive instances in
the dataset. the precision and f1_score are 91% and 90% respectively. This is an
important metric, especially in scenarios where correctly identifying positive
instances is crucial. A high recall value indicates that the model is effective at
capturing positive instances, reducing the chances of false negatives.
19
GRU+SVM Free 0.88 0.90 0.89 0.90
+word2vec
Offensive 0.89 0.88 0.89
Race 0.91 0.88 0.89
Politics 0.91 0.94 0.92
Religion 0.95 0.83 0.89
Table 4- 9 Result of Ensemble GRU+SVM
The precision of the model was measured to be 0.90, indicating that out of all the
instances predicted as positive, 90% were actually positive. Precision measures the
ability of the model to avoid false positives. A high precision value signifies that the
model is proficient in correctly classifying positive instances and minimizing false
positives.
20
Figure 4- 8 Confusion Matrix of Ensemble GRU+SVM
These results suggest that the GRU+SVM ensemble approach is effective in hate
speech detection in Afan Oromo language on social media. The combination of the
GRU model's ability to capture sequential dependencies and the SVM model's
strength in classification contributes to the overall performance of the ensemble.
4.5.9. Experiment with GRU+LR
Another experimentation is ensemble of Gate Recurrent Unit (GRU) and Logistic

Regression (LR). By combining the two models we create ensemble model. The
ensemble GRU+LR achieved an accuracy of 90%.
In terms of recall, the ensemble model achieved a value of 0.90, meaning that it
correctly identified 98% of the positive instances in the dataset.
The precision of the ensemble model was measured to be 0.88, indicating that out of
all the instances predicted as positive, 88% were actually positive. Precision measures
the model's ability to avoid false positives. A high precision value indicates that the
model is proficient in correctly classifying positive instances and minimizing false
positives.

GRU+LR Free 0.88 0.90 0.89 0.90
+word2vec
Offensive 0.89 0.88 0.88
Race 0.89 0.88 0.88
Politics 0.92 0.93 0.92
Religion 0.92 0.84 0.88
Table 4- 10 Result of Ensemble GRU+LR
Comparing to GRU+SVM the accuracy result is the same. However, there are some
differences in the other metrics such as precision, recall and f1_score. Overall, these
21
results demonstrate that the GRU+LR ensemble approach is effective in detecting
hate speech in Afan Oromo language on social media.
Figure 4- 9 Confusion Matrix of GRU+LR
4.5.10. Experiment with GRU+RF
The model achieved an accuracy of 99%, indicating that it correctly predicted 99% of
the instances. The model achieved a recall of 0.98, meaning that it correctly identified
98% of the positive instances in the dataset. Recall is an important metric, especially
in scenarios where correctly identifying positive instances is crucial. A high recall
value indicates that the model is effective at capturing positive instances, reducing the
chances of false negatives. A precision of 0.98, indicating that out of all the instances
predicted as positive, 98% were actually positive. Precision measures the ability of the
model to avoid false positives.
22
GRU+RF Free 0.96 0.98 0.97 0.99
+word2vec
Offensive 0.99 0.97 0.98
Race 0.99 0.99 0.99
Politics 0.99 0.99 0.99
Religion 0.99 0.98 0.99
Table 4- 11 Result of GRU+RF
Figure 4- 10 Confusion Matrix of GRU+RF
23
The ensemble of GRU+RF outperforms all the other model in all performance
metrics.
4.5.11. Experiment with BiLSTM
The detail configuration of BiLSTM is depicted in the table below. Here the first layer
is consisting of embedding dimension which 53475 that indicates the size of
vocabulary in our dataset. We have a pre-trained Word2Vec model that provides word
embeddings. Each word in the vocabulary is assigned a unique vector representation,
and the weights of these vectors are learned by predicting the context words given a
target word or vice versa. The learned weights hold the semantic relationships
between words based on their co-occurrence patterns in the training dataset.
Parameters Values
Embedding dimension Input dimension=53475, output
dimension=300
Dropout rate 0.5 recurrent dropout =0.3
Memory unit 200
Epochs 30
Batch size 64
Optimizer Adam
Table 4- 12 Model Configuration of BiLSTM
The output of Embedding layer is fed to BiLSTM layer which contains of 200 units.
We used 50% dropout, which drops or deactivates 50% of neurons randomly. This
technique is used to avoid overfitting. As well as we used 30% dropout for recurrent
connections. For the training of the model, the batch size was set to 64 and the
number of epochs to 30. At the last layer, softmax activation function is used with
dense layer of 5 output units which is used to predict the class of hate speech.

BiLSTM Free 0.87 0.88 0.86 0.85
+word2vec
Offensive 0.86 0.86 0.84
24
Race 0.85 0.86 0.87
Politics 0.88 0.85 0.84
Religion 0.84 0.84 0.85
Table 4- 13 Result of BiLSTM
The proposed BiLSTM achieved accuracy of 85%, precision of 85%, recall of 86%,
and an F1-score of 85%.
4.5.12. Experiment with BiLSTM+SVM
The other experiment we did is, an ensemble approach by combining a Bidirectional

LSTM (BiLSTM) model and a Support Vector Machine (SVM) model. The same to
the previous ensemble models, first Both models were trained and evaluated
individually, and their performances were compared to the ensemble model's
performance.
As we discussed earlier the individual BiLSTM model achieved an accuracy of 85%,

precision of 85%, recall of 86%, and an F1-score of 85%. These metrics indicate that
the BiLSTM model was able to accurately classify the majority of instances, with a
good balance between precision and recall.
On the other hand, the individual SVM model achieved an accuracy of 78%, precision
of 78%, recall of 77%, and an F1-score of 80%. These metrics suggest that the SVM
model performed slightly lower than the BiLSTM model, but still achieved good
results.

BiLSTM+SVM Free 0.87 0.89 0.88 0.89
+word2vec
Offensive 0.87 0.87 0.86
Race 0.87 0.86 0.87
Politics 0.90 0.92 0.91

25
Religion 0.90 0.82 0.86
Table 4- 14 Result of Ensemble BiLSTM+SVM
When combining the predictions of the BiLSTM and SVM models in the ensemble,
we observed an improvement in performance. The ensemble model achieved an
accuracy of 89%, precision of 88%, recall of 87%, and an F1-score of 88%. These
results indicate that the ensemble model was able to leverage the strengths of both the
BiLSTM and SVM models to achieve better overall performance compared to the
individual models.
Figure 4- 11 Confusion Matrix of Ensemble BiLSTM+SVM
4.5.13. Experiment with BiLSTM+LR
The other combination is Bidirectional Long Short-Term Memory (BiLSTM) with

Logistic Regression (LR). As stated before, both models were trained individually,
and their predictions were combined to form the ensemble model.
26
The individual BiLSTM model achieved an accuracy of 85%, precision of 85%, recall
of 86%, and an F1-score of 85%. These metrics indicate that the BiLSTM model
performs well in accurately classifying instances, with a good balance between
precision and recall.
The individual LR model achieved an accuracy of 78%, precision of 76%, recall of

80%, and an F1-score of 78%. While the LR model's performance is slightly lower
than the BiLSTM model, it still achieves reasonable results.

BiLSTM+LR Free 0.87 0.89 0.88 0.88
+word2vec
Offensive 0.87 0.86 0.86
Race 0.87 0.86 0.86
Politics 0.91 0.91 0.91
Religion 0.89 0.81 0.85
Table 4- 15 Result of Ensemble BiLSTM+LR
When combining the predictions of the BiLSTM and LR models in the ensemble, we
observed an improvement in performance. The ensemble model achieved an accuracy
of 88%, precision of 88%, recall of 87%, and an F1-score of 87%. These metrics
indicate that the ensemble model was able to leverage the strengths of both the
BiLSTM and LR models to achieve better overall performance compared to the
individual models.
27
Figure 4- 12 Confusion Matrix of Ensemble BiLSTM+LR
4.5.14. Experiment with BiLSTM+RF
In this experiment, we explored an ensemble approach by combining a Bidirectional

LSTM (BiLSTM) model and a Random Forest (RF) model. Both models were trained
individually, and their predictions were combined to form the ensemble model.
The individual BiLSTM model achieved an accuracy of 85%, precision of 85%, recall
of 86%, and an F1-score of 85%. These metrics indicate that the BiLSTM model
The individual RF model achieved an accuracy of 82%, precision of 82%, recall of

79.8%, and F1 score of 80%. While the RF model's performance is lower than the
BiLSTM model, it still achieves reasonable results.
When combining the predictions of the BiLSTM and RF models in the ensemble, we
28
indicate that the ensemble model was able to leverage the strengths of both the
BiLSTM and RF models to achieve better overall performance compared to the
individual models.

GRU+ RF Free 0.96 0.98 0.97 0.99
+word2vec
Offensive 0.99 0.97 0.98
Race 0.99 0.99 0.99
Politics 0.99 0.99 0.99
Religion 0.99 0.98 0.99
Table 4- 16 Result of Ensemble BiLSTM+RF
Comparing the ensemble model to the baseline models, it outperformed both Baseline
Model A and Baseline Model B in terms of all metrics. This suggests that the
ensemble approach was effective in improving the classification accuracy, precision,
recall, and F1-score compared to the baselines.
29
Figure 4- 13 Confusion Matrix of ensemble BiLSTM+RF
One important aspect to consider is the interpretability of the models. The BiLSTM
model captures complex patterns and dependencies in sequential data but lacks
interpretability. On the other hand, the RF model provides more interpretable decision
boundaries but may not capture complex patterns as effectively. The ensemble
leverages the strengths of both models, combining their predictive power and
potential interpretability.
4.5.15. Experiment with LSTM
To develop LSTM model first we define the LSTM model architecture, which
consists of an embedding layer, LSTM layer(s), and a final output layer.
Parameters Values
Embedding dimension Input dimension = 53475, output dimension=300
Dropout rate 0.5
Memory unit 200 for both LSTM layers
Epochs 30
Batch size 64
Optimizer Adam
Table 4- 17 Model Configuration for LSTM
30
We used SoftMax activation function at output layer for multi-label(class) problem.
The dense layer has five (5) unit which represents number of classes. In addition to
that, as other deep learning, the dropout regularization is used to avoid overfitting. In
this case we used 0.5 dropout rate (50%) of neurons are dropped.

LSTM Free 0.78 0.82 0.80 0.80
+word2vec
Offensive 0.80 0.79 0.82
Race 0.81 0.78 0.83
Politics 0.79 0.81 0.82
Religion 0.81 0.80 0.83
Table 4- 18 Result of LSTM
31
Figure 4- 14 Confusion Matrix of LSTM
4.5.16. Experiment with LSTM+SVM
An ensemble approach by combining an LSTM model and a Support Vector Machine

(SVM) classifier. Both models were trained individually, and their predictions were
combined to form the ensemble model.
The individual LSTM model achieved an accuracy of 80%, precision of 78%, recall
of 82%, and an F1-score of 80%. These metrics indicate that the LSTM model
The individual SVM model achieved an accuracy of 78%, precision of 78%, recall of
77%, and an F1-score of 80%. While the SVM model's performance is slightly lower
than the LSTM model, it still achieves reasonable results.
32
LSTM+SVM Free 0.89 0.90 0.89 0.90
+word2vec
Offensive 0.89 0.87 0.88
Race 0.90 0.88 0.89
Politics 0.91 0.94 0.93
Religion 0.93 0.84 0.89
Table 4- 19 Ensemble Result of LSTM+SVM
When combining the predictions of the LSTM and SVM models in the ensemble, we
indicate that the ensemble model was able to leverage the strengths of both the LSTM
and SVM models to achieve better overall performance compared to the individual
models.
33
Figure 4- 15 Confusion Matrix of LSTM+SVM

4.5.17. Experiment with LSTM+LR
we explored an ensemble approach by combining an LSTM (Long Short-Term

Memory) model and a Logistic Regression (LR) classifier. Both models were trained
individually, and their predictions were combined to form the ensemble model.
Algorithm Hate Evaluation metrics

speech precision recall F1_score Accuracy
classes
LSTM+LR+word2vec Free 0.89 0.90 0.90 0.91
Offensive 0.89 0.88 0.89
Race 0.90 0.88 0.89
Politics 0.92 0.94 0.93
Religion 0.93 0.86 0.89
Table 4- 20 Ensemble Result of LSTM+LR
The individual LR model achieved an accuracy of 77.7%, precision of 79.3%, recall

of 76.6%, and F1 score of 81.0%. The LR model's performance is lower than the
LSTM model.
34
Figure 4- 16 Confusion Matrix of LSTM+LR
When combining the predictions of the LSTM and LR models in the ensemble, we
and LR models to achieve better overall performance compared to the individual
models.
4.5.18. Experiment with LSTM+RF
In this experiment, we explored an ensemble approach by combining an LSTM (Long

Short-Term Memory) model and a Logistic Regression (LR) classifier. Both models
were trained individually, and their predictions were combined to form the ensemble
model.
35

LSTM+LR+word2vec Free 0.96 0.99 0.97 0.99
Offensive 0.99 0.97 0.98
Race 0.99 0.99 0.99
Politics 0.99 0.99 0.99
Religion 0.99 0.98 0.99
Table 4- 21 Confusion Matrix of LSTM+RF
The individual LR model achieved an accuracy of 80%, precision of 78%, recall of

82%, and an F1-score of 80%. While the LR model's performance is slightly lower
than the LSTM model, it still achieves reasonable results.
When combining the predictions of the LSTM and LR models in the ensemble, we
and LR models to achieve better overall performance compared to the individual
models.
36
Figure 4- 17 Confusion Matrix of LSTM+RF
4.6. Summary
In the study on "Ensemble Deep Learning for Multi-label Afan Oromo Hate Speech
Detection on Social Media," the following models and their combinations were
explored for hate speech detection in Afan Oromo language on social media. Here is a
summary of the models, their results, and a comparison of their performance: SVM
(Support Vector Machine): The SVM model achieved an accuracy of 78%, precision
of 78%, recall of 77%, and an F1-score of 80%. While SVM is a traditional machine
learning algorithm, its performance in capturing complex patterns and dependencies
in text data may be limited. LR (Logistic Regression): The LR model achieved an
77.7%, precision of 79.3%, recall of 76.6%, and F1 score of 81.0%. Similar to SVM,
logistic regression is a linear classifier that may not effectively capture complex
patterns in text data. RF (Random Forest): The RF model achieved an accuracy of
84%, precision of 85%, recall of 82%, and an F1-score of 84%. While Random Forest
can handle multi-label classification, it may not capture the sequential nature of text
data as effectively as deep learning models.
The CNN model achieved an accuracy of 85% with 86%, 84%, 85% precision, recall
and f1_score respectively. CNNs can capture local dependencies and patterns in text,
making them suitable for hate speech detection. However, they may not effectively
capture long-term dependencies. The BiLSTM model achieved an accuracy of 86%,
37
precision of 85%, recall of 86%, and an F1-score of 85%. BiLSTM models can
capture both forward and backward dependencies, making them effective in capturing
long-term patterns in text data. The individual LSTM model achieved an accuracy of
80%, precision of 78%, recall of 82%, and an F1-score of 80%. The GRU achieved
accuracy of 85% with 86%, 84%, 85% precision, recall and f1_score respectively.
Ensemble Models (Combining Different Models): The ensemble models achieved

even better results compared to the individual models. The ensemble combinations
include CNN+SVM (95%), CNN+LR, CNN+RF, LSTM+SVM, LSTM+LR,
LSTM+RF, BiLSTM+SVM, BiLSTM+LR, BiLSTM+RF, GRU+SVM, GRU+LR
and GRU+RF. The ensemble models leverage the strengths of different models and
achieve higher accuracy, precision, recall, and F1-scores. The accuracy result of the
ensemble of CNN with the three different machine learning is the same, which is 95%
except of the precision, recall and f1_score of specific classes. The ensembles of
GRU+SVM is 90%, GRU+RF is 99% and GRU+LR 90%. The ensemble of GRU
with SVM and LR the same but, there is a few differences between their precision,
recall and f1_score. The accuracy results of the ensembles BiLSTM+SVM,
BiLSTM+LR and BiLSTM+RF is 89%, 88% and 99% respectively. The last
ensemble models are LSTM with machine learning. The LSTM+SVM achieved
accuracy of 90%, LSTM+LR 91% and LSTM+RF 99%.
38
Chapter 5: Conclusions and Recommendations
1.1. Conclusions
The pervasive presence of social media platforms has become a powerful tool for
communication, enabling individuals to share their thoughts, opinions, and
experiences with a global audience. While this unprecedented connectivity has
brought numerous benefits, it has also given rise to a concerning issue: the
proliferation of hate speech. Hate speech, characterized by discriminatory, offensive,
or harmful content targeting individuals or groups based on attributes such as race,
religion, ethnicity, or politics, etc. poses a significant threat to social harmony, online
discourse, and ultimately, society as a whole. Detecting hate speech in social media
content is an essential task, not only for maintaining a respectful and inclusive online
environment but also for legal and ethical reasons. Moreover, it is particularly
challenging in languages with limited resources and tools, such as Afan Oromo, one
of the widely spoken languages in East Africa. Afan Oromo has gained prominence in
the digital landscape due to its use in various social media platforms, making it crucial
to develop effective hate speech detection methods tailored to this language.
This study aims to address this pressing issue by proposing an ensemble deep learning
approach for multi-label Afan Oromo hate speech detection on social media.
Ensemble learning combines multiple machine learning models to enhance predictive
performance and robustness. In this study, various models were explored for hate
speech detection in the Afan Oromo language on social media. The models
considered included support vector machine, logistic regression, random forest,
convolutional neural network, long short-term memory, Gated Recurrent Unit and
Bidirectional LSTM.
In this study we collected about 35,000 public comments from three different social
media networks namely, Facebook, twitter and YouTube. The dataset collected is
annotated with four annotators including the researcher. We applied different pre-
processing techniques such as stop word removal, normalization and cleaning.
In order to see the difference, first of all we developed and tested with individual
models. After developing and testing the result of the individual models, we combined
the prediction of each individual models (traditional machine learning) with deep
39
learning. The SVM model achieved an accuracy of 78%, precision of 80%, recall of
76%, and an F1-score of 78%. While SVM is a traditional machine learning
algorithm, its performance in capturing complex patterns and dependencies in text
data may be limited. The LR model achieved an accuracy of 82%, precision of 81%,
recall of 84%, and an F1-score of 82%. Similar to SVM, logistic regression is a linear
classifier that may not effectively capture complex patterns in text data. The RF
model achieved an accuracy of 84%, precision of 85%, recall of 82%, and an F1-score
of 84%. While Random Forest can handle multi-label classification, it may not
capture the sequential nature of text data as effectively as deep learning models.
The CNN model achieved an accuracy of 85% with 86%, 84%, 85% precision, recall
and f1_score respectively. CNNs can capture local dependencies and patterns in text,
making them suitable for hate speech detection. However, they may not effectively
capture long-term dependencies. BiLSTM (Bidirectional LSTM): The BiLSTM
model achieved an accuracy of 85%, precision of 85%, recall of 86%, and an F1-score
of 85%. BiLSTM models can capture both forward and backward dependencies,
making them effective in capturing long-term patterns in text data. The individual
LSTM model achieved an accuracy of 80%, precision of 78%, recall of 82%, and an
F1-score of 80%. The GRU achieved accuracy of 85% with 86%, 84%, 85%
precision, recall and f1_score respectively.
All of the ensemble models achieved even better results compared to the individual
models. The ensemble combinations include CNN+SVM (95%), CNN+LR,
CNN+RF, LSTM+SVM, LSTM+LR, LSTM+RF, BiLSTM+SVM, BiLSTM+LR,
BiLSTM+RF, GRU+SVM, GRU+LR and GRU+RF. The ensemble models leverage
the strengths of different models and achieve higher accuracy, precision, recall, and
F1-scores. The accuracy result of the ensemble of CNN with the three different
machine learning is the same, which is 95% except of the precision, recall and
f1_score of specific classes. The ensembles of GRU+SVM is 90%, GRU+RF is 99%
and GRU+LR 90%. The ensemble of GRU with SVM and LR the same but, there is a
few differences between their precision, recall and f1_score. The accuracy results of
the ensembles BiLSTM+SVM, BiLSTM+LR and BiLSTM+RF is 89%, 88% and
99% respectively. The last ensemble models are LSTM with machine learning. The
LSTM+SVM achieved accuracy of 90%, LSTM+LR 91% and LSTM+RF 99%. From
40
this result we consider that BiLSTM, GRU and LSTM with Random Forest (RF)
outperformed all the other ensemble models.
In general, Comparing the individual models, the ensemble models, and the traditional
machine learning models, the ensemble models consistently outperform the individual
models and traditional machine learning algorithms. The ensemble models
successfully combine the strengths of different models and improve overall
performance in hate speech detection on social media.
41
1.2. Recommendations and Future Directions
The primary purpose of this study was to design and develop an automated system
that can efficiently work for ensemble deep learning for multi label Afan Oromo hate
speech detection on social media. In order to develop a full-fledged automated system
that can efficiently work for ensemble deep learning for multi label Afan Oromo hate
speech detection on social media, it needs coordinated teamwork from linguistic
expertise and computer science expertise. Even if the results are promising for this
study, our research has many more limitations. Future work and recommendations for
further improvement in hate speech detection on social media using ensemble deep
learning models for the Afan Oromo language include:
 Dataset Expansion: Collecting and annotating a larger and more diverse

dataset specifically for Afan Oromo hate speech detection can help improve
model performance. A larger dataset can capture a wider range of hate speech
patterns and nuances.
 Fine-tuning and Hyperparameter Optimization: Conducting more extensive
experiments to fine-tune the models and optimize hyperparameters can lead to
better performance. This involves exploring different architectures, activation
functions, learning rates, regularization techniques, and other hyperparameters
specific to each model.
 Handling Imbalanced Data: Addressing class imbalance in the dataset can
improve the model's ability to detect minority hate speech classes accurately.
Techniques such as oversampling, undersampling, or using class weights
during training can help tackle this issue.
 Transfer Learning: Investigating the use of transfer learning techniques, such
as pre-training on a large-scale hate speech detection dataset in a different
language, and fine-tuning on the Afan Oromo dataset can potentially improve
model performance.
 Cross-Validation and Evaluation Metrics: Conducting cross-validation
experiments and using appropriate evaluation metrics such as macro-averaged
F1-score or area under the receiver operating characteristic curve (AUC-ROC)
can provide a more comprehensive evaluation of model performance.
42
 Online Learning and Real-Time Detection: Investigating methods to adapt the
ensemble models for online learning, where the models can be continuously
updated with new data, can enable real-time hate speech detection on social
media platforms.
By addressing these future work areas and recommendations, the performance and
effectiveness of ensemble deep learning models for Afan Oromo hate speech
detection can be further enhanced, leading to more accurate and robust detection
systems on social media platforms.
43
44

Final Draft

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Draft

Uploaded by

Copyright:

Available Formats

BULE HORA UNIVERSITY

DEPARTMENT OF SOFTWARE ENGINEERING

ENSEMBLE DEEP LEARNING FOR MULTI LABEL AFAAN OROMO HATE

BY: DABA ADUGNA

ATHESIS SUBMITTED TO COLLEGE OF INFORMATICS DEPARTMENT OF

BULE HORA, ETHIO

1.1. Background of the study..................................................................................7

1.3. Statement of Problem......................................................................................9

1.4. Significance of the Study...................................................................................10

1.2. Research questions............................................................................................10

2.2. Hate speech........................................................................................................12

2.2.1 Target and Categories of hate Speech..........................................................13

2.3. The Social Media...............................................................................................14

2.4. Hate Speech on social media in Ethiopia..........................................................14

2.5. Natural language processing and Afaan Oromo................................................15

2.5.1. Natural Language Processing for Hate Speech Detection..........................15

2.5.2. Overview of Afaan Oromo.........................................................................16

2.6. Challenges in Hate Speech Detection in Afaan Oromo....................................19

2.7. Hate speech detection Techniques.....................................................................19

2.7.1. Feature extraction...........................................................................................19

2.7.3. Deep learning approach for Hate Speech Detection...................................24

2.7.3.5. Bi-Directional Long Short-Term Memories (Bi-LSTM).........................30

2.8. Ensemble deep learning.....................................................................................31

2.8.1. Averaging Ensemble...................................................................................32

2.8.3. Meta-learning Ensemble.............................................................................32

2.9. Related work......................................................................................................33

1.1. Data Collection Method................................................................................40

1.2. Data Sources..................................................................................................40

3.2. Data Preparation................................................................................................42

3.3.1. Data Annotation..........................................................................................42

3.3.2 Manual annotation of text using annotation guidelines...............................43

3.3.3. Validation by expertise...............................................................................43

3.3.4. Finalization of annotation based on inter annotator agreement..................43

3.4. Data pre-processing.......................................................................................44

3.4.1. Data Cleaning.............................................................................................45

3.4.2. Stop word Removal....................................................................................45

3.4.3. Short Word Expansion................................................................................45

3.4.4. Data normalization......................................................................................46

3.5. Feature Engineering...........................................................................................46

3.6. Splitting Dataset................................................................................................47

3.7. Training of Models............................................................................................47

3.7.1. Training set.................................................................................................47

3.7.2. Selection of models.....................................................................................48

3.7.3. Training of deep learning models...............................................................48

3.8. Evaluation methods...........................................................................................48

3.8.3. Model evaluation methods..........................................................................48

3.9. The Proposed Ensemble model.........................................................................50

Chapter 4: Results and Decisions.................................................................................52

4.1. Data source....................................................................................................52

4.2. Data Annotation.............................................................................................52

4.3. Data pre-processing.......................................................................................53

4.4. Evaluation Metrics.........................................................................................55

4.5. Experiment Results........................................................................................56

4.5.1. Experiment with SVM...............................................................................56

4.5.2. Experiment with Logistic Regression........................................................57

4.5.3. Experiment with Random Forest...............................................................57

4.5.4. Experiment with Ensemble CNN...........................................................58

4.5.4. Experiment with Ensemble CNN+SVM....................................................59

4.5.5. Experiment with Ensemble CNN+LR.......................................................60

4.5.6. Experiment Ensemble CNN+RF...............................................................62

4.5.7. Experiment with GRU...............................................................................63

4.5.8. Experiment with GRU+SVM.....................................................................65

4.5.9. Experiment with GRU+LR........................................................................67

4.5.10. Experiment with GRU+RF.......................................................................68