Building A Dataset For Detecting Fake News in Amharic Language

ISSN (Online) 2581-9429
IJARSCT
International Journal of Advanced Research in Science, Communication and Technology (IJARSCT)
Volume 6, Issue 1, June 2021

Impact Factor: 4.819
Building a Dataset for Detecting Fake News in

Amharic Language
Tewodros Tazeze1 and Raghavendra R2
MSc IT Student, Department of Master of Science in Information Technology1
Assistant Professor, School of Computer Science & Information Technology2
JAIN (Deemed-to-be University), Bangalore, India
Abstract: The rapid growth and expansion of social media platform has filled the gap of information
exchange in the day to day life. Apparently, social media is the main arena for disseminating
manipulated information in a high range and exponential rate. The fabrication of twisted information is
not limited to ones language, society and domain, this is particularly observed in the ongoing COVID-19
pandemic situation. The creation and propagation of fabricated news creates an urgent demand for
automatically classification and detecting such distorted news articles. Manually detecting fake news is a
laborious and tiresome task and the dearth of annotated fake news dataset to automate fake news
detection system is still a tremendous challenge for low-resourced Amharic language (after Arabic, the
second largely spoken Semitic language group). In this study, Amharic fake news dataset are crafted
from verified news sources and various social media pages and six different machine learning classifiers
Naïve bays, SVM, Logistic Regression, SGD, Random Forest and Passive aggressive Classifier model
are built. The experimental results show that Naïve bays and Passive Aggressive Classifier surpass the
remaining models with accuracy above 96% and F1- score of 99%. The study has a significant
contribution to turn down the rate of disinformation in vernacular language.
Keywords: Amharic, Fake News, Machine Learning, Natural Language Processing
I. INTRODUCTION
Fake news refers to news stories that are false or nonfactual: the storey is manipulated, with no authenticated facts,
sources, or quotes. Sometimes these stories are brainwashing intended to deceive the reader, or they are designed as
clickbait. Clickbait is written mainly for getting economic advantage by boosting the news headline to get more number
of clicks from the readers. The general anti-social behavior of social media users can be generally categorized in two
main parts [1]. The first group is the spreader of misinformation, which can take many forms, such as hoaxes and the
dissemination of fake news on the internet. The second group is knows by reacting to specific users, which could
include discussion manipulation, cyber bullying, or other forms of similar behavior. Both types of anti-social behavior
pose a serious problem because the consequences can be crucial in the real world.
Amharic is a Semitic language family. It is the world's second most spoken Semitic language, after Arabic. The
federal government of Ethiopia used Amharic language as a working language. In addition to that, it is also the working
language (lingua franca) of several regional states and cities within the federal government [2]. The huge acceptance of
social media and the rapid increase of social media users in the country contributed as a venue for spreading of
fabricated information. Dissemination of unverified or falsehood information in social platform especially in Facebook
sometimes serves as a potential cause to spark ethnic and religious clashes. The common method of checking whether
the news is fake or real is done in traditional and manual approach of verifying from different legit sources, the time
and effort requires to manually verifying the information is high.
A well know Amharic news checker channel called Ethiopia Check conducted a poll for 54,400 participant in its
Telegram channel about social media users experience on dealing with misinformation and the result revealed that 58%
of the readers are decided to further checking the truthfulness of the news from other verified news sources. The other
38% of the respondent are only consume the news without doing any further verification. Another poll conducted by
Copyright to IJARSCT DOI: 10.48175/IJARSCT-1362 76

www.ijarsct.co.in
IJARSCT

the same news checker channel asked 2209 respondent about the main contributor of fabricating misinformation. Most
of the respondent believes that the so called Activists takes the lion’s share by producing unverified news, Bloggers,
Broadcast Medias and Journalists, Government organization and opposition bodies follows sequentially.
In this research work, we proposed to build Amharic fake news dataset from various news sources. The major
contribution of this research is
1. A Fake news dataset is constructed in Amharic language.
2. Different machine learning models are proposed to detect Amharic fake news.
The rest of this paper is organized as follows. Section two discusses the existing approach of building fake news dataset
and classifier in different language. In section three details of the framework and the proposed system architecture is
presented. In section Four the experimental result and evaluation metrics of the study is presented. Possible future
extension, contribution and conclusion of the work also will be presented in Chapter Five.
II. LITERATURE REVIEW

Many researchers are conducted researches and proposed methodologies to automatically detect fake news in the
recent years. In this section fake news detection dataset and methodologies are presented. Generally one of the serious
challenges in machine-learning-based approaches, and specifically in automatic fake news classification, is preparing a
large, rich, and consistently annotated fake news dataset on which the model can be trained and tested. [3] Chose a data
collection process in which each story has an underlying article published on the web, and each such storey is
independently verified. The research relies on professional non-profit journalist fact-checking organizations to verify
the veracity of the storey in the classification of real or fake article.
Fake news dataset are constructed for Bengali [4], English [5] and Urdu [6]. [4] Collected 8.5k news’s which mainly
contains misleading or false context, click bait and satire or parody articles from trusted news sources in Bangladesh.
Research conducted for under resourced Urdu language [6] uses Fake news dataset originally in Urdu language from
different domain and machine translated fake news dataset from English language and a combination of two dataset to
train the model. Another study by [5] uses 13,000 labelled Fake news articles from public domain in multiple subjects
for the study. Most of the recent researches are employed fake news dataset only in text format [7] crafted multimodal
dataset called Fakeddit. The English dataset contains image and text data, comment data, metadata, and a fine-grained
fake news categorization dataset. The research foresees that these additional multimodal features will be useful for
tracking a user’s credibility through using metadata and comment data. Annotated news dataset from the Kaggle
competition was used [21]. The dataset included 20386 articles from the political news domain in total.
To identify fundamental theories, detection methods, and opportunities for fake news, [8] conducted a survey review
that evaluated methods for detecting fake news from four perspectives. The survey's four concerns are the knowledge
carried by fake news, its writing style, propagation pattern, and source credibility. This survey thoroughly reviews and
assesses current fake news research by defining fake news and distinguishing it from deceptive news, nonfactual news,
satire news, misinformation, disinformation, clickbaits, cheery-picking, and rumors based on three characteristics:
authenticity, intention, and being news.
The majority of existing efforts to detect fake news propose features that can make use of information present in a
particular dataset. Research by [9], [11] and [12] employed machine learning classifier along with other text features.
[11] Evaluated textual features proposed by recent researchers for feature extraction and grouped the reviewed features
in to five feature categories. Language feature (bag of words, n-gram, POS), lexical feature (character and word level
feature), psycholinguistic feature (linguistic inquiry and word count), semantic feature and subjectivity (Text Blob) are
the feature categorized by the researcher. Discriminative power of the previous features using a variety of classic and
cutting-edge classifiers are examined, such as Random Forests (RF), XGBoost (XGB), k-Nearest Neighbors (KNN),
Support Vector Machine with RBF kernel (SVM) and Naive Bays (NB). According to the experimental results, the
prediction performance of proposed features combined with existing classifiers has a useful degree of discriminative
power for detecting fake news. Nearly all fake news is correctly detected in the data and 40% true news is
misclassified.
www.ijarsct.co.in
IJARSCT

To assess the accuracy of two credibility-focused Twitter datasets called PHEME (a dataset of potential gossip on
Twitter and journalistic assessments of their accuracy) and CREDBANK (a crowd-sourced dataset of accuracy
assessments for events on Twitter) [13] creates automatic detecting fake news method on Twitter. This paper shows
how an automated system can detect fake news in popular Twitter threads. Moreover, the study concludes that
employing non-professional, crowd-sourced volunteers rather than journalists’ experts provides a useful and
inexpensive way to rapidly classify real and fake stories on Twitter.
Many helpful methods for fake news detection have been developed in recent studies, including the use of sequential
neural networks to encode news content and social context-level information, where the text sequence was analyzed in
a unidirectional manner. As a result, for detecting fake news, a bidirectional training approach [14] is used; this
approach is based on the BERT (Bidirectional Encoder Representations from Transformers) deep learning approach.
BERT is used as a sentence encoder, and it can accurately obtain a sentence's context representation.
Recently Blockchain are joined the battle field to fight hoax news. Apart from voting, verifying resumes, supply
chains, and many other areas Blcokchain solution are applying to fight misinformation [15]. Some of Blockchain
enabled fake news detections are Truepic [16], News provenance project [17] is lead by The New York Times
researcher and Voice [18].
The target language of this study is Amharic language. Amharic is one of morphologically rich languages and has its
own script called Fidel. It is related to Arabic and Hebrew, and it is the second most widely spoken Semitic language
after Arabic, with 22 million native speakers. [19] Conducted a research on Amharic fake news detection using deep
learning and news content, as well as the development of several computational linguistic resources for this under
resource language. However, due to unavailability of online labelled fake news dataset for this language makes it
harder to conduct research and provide convenient solution. The dire need of fighting misinformation in vernacular
language derives the author of this research to build a fake news dataset and classification model from the scratch.
III. METHOD
The availability and quality of annotated datasets is the most significant challenge for automated Amharic fake news
detection. The scarcity of manually labeled fake news datasets is undoubtedly a barrier to the advancement of
computationally intensive, text-based models that cover a wide range of topics. The dataset for the fake news challenge
does not fit our needs because it contains the ground truth about the relationships between texts but not whether those
texts are true or false statements.
Figure 1: Proposed fake news detection Architecture

www.ijarsct.co.in
IJARSCT

3.1 News Collection

We sourced our dataset from news website and facebook where users can post submissions on various news.
Facebook empower more that 3 billion people around the world to share their ideas, more than 100 billion messages are
shared every day and 1 billion stories shared every day. In this research Facebook are serves as a main source of both
fake and real news, apart from Facebook some news website are scraped to get legit news. The following table shows
the statistics of news collected.
3.2 News Labeling

After collecting news from various sources, the next step is labeling authentic and unauthenticated news; we select
four most popular mainstream trusted news portals and various Facebook pages which post news article in Amharic
language. For labeling the news as fake and real news we adopt strategies shown below from [4].
 News which contains untrustworthy information or includes facts that can deceive the reader.
 News that employs subtle headlines to pique readers' interest and drive click-throughs to the publisher's website.
 News content is accurate in accordance with fact.
 Misused data
 Slanted and biased
3.3 Pre-processing
Like the other languages, the Amharic language also has its own punctuation marks which separate texts or
sentences into a stream of words. Amharic punctuation marks include ‘Hulet netb’ or colon (፡), ‘arat netb ‘(።), ‘netela
serez’(፣), ‘derib sereze’ (፤), question mark’?’ and exclamation mark ‘! Or ¡’ are used as sentence delimiter or as white
space [2]. In this paper work, we take all the punctuation marks as word delimiter but ‘arat netb’, ‘hulet netb’, and
white space are widely used applied to tokenize the word efficiently.
3.4 Feature Extraction

In this study, TF-IDF, CountVectorizer, and N-gram are used to draw out features from news text. TF-IDF is a
weight metric that determines the significance of a word in a document [20] and one of the largely used methods in
information retrieval and text mining. CountVectorizer and TF-IDF vectorizer both convert news documents and
generate a matrix in which each unique word is represented by a column of the matrix and each text sample from the
document is represented by a row of the matrix. The value of each cell is simply the number of words in that text
sample. N-gram is simply any sequence of n tokens (words). In this research most frequent bigram and trigram words
are extracted from the news corpus and relevant feature and pattern of the Amharic news is studied.
IV. EXPERIMENTAL RESULT

4.1 Dataset
News article are scraped using selenium webdriver and facebook scraper library of python. We have collected news
from both well known news websites and facebook page that publish both misleading and factual news in Amharic.
While collecting misleading news from facebook we found that most of the pages have the exact same news. As a
result, after scraping news from these sites, we removed the repeated news. For each news sample, we provide one
label, allowing us to train the classification model. With this label, we will be able to train for fake news detection at a
high or fine-grained level. The labeling is performed manually which is tiresome and backbreaking due to taking much
more time to cross check the authenticity of the news. Sample labeled fake news is presented in figure (2).

www.ijarsct.co.in
IJARSCT

Figure 2: Sample fake labeled news

The above news piece states about “assassination attempt” on the current (May 2021) prime minister of Ethiopia.
The relationship between the headline and the main article is 100% associated but both the headline and the main
article is contains none factual information then the news labeled as fake (1). By strictly implementing the above
labeling guideline in section three around 7547 news are examined, out of the observed news 961 found out to be real
news and 457 are fake new.
Figure 3: Most frequent words of unprocessed fake news

As it can be observed from Figure 3 the generated wordcloud in both unprocessed fake and real news you tube,
faceboook and website links are highly occurred in the corpus. To remove such kinds of unwanted links, emojis and
punctuations we have developed a pre-processing unit in python. The main function of the unit is to remove and replace
the tokens which are highly seen in the corpus using python regular expression (regx). In this research 129,850 tokens
from the main news article is created and it will serve us as a input for future analysis part. Words which are less
significant to the meaning of the entire news article are considered as a stop words and removed from the news article.
Stop words which are occurred frequently in the article such as breaking news “ሰበር ዜና” are used to boost the news
content and increase interest of the readers.
4.2 Tracking N-Gram Words

We wanted to learn what patterns it was learning that resulted in such a high accuracy of being able to classify fake
and real news. Text mining relies on n-grams, which are a set of co-occurring or continuous sequences of n items from
a large text or sentence sequence. The item in this case could be words, letters, or syllables. In this research Bigram and
Trigram are extracted from the news article and most frequently occurring Bigram and Trigram are accessed.
www.ijarsct.co.in
IJARSCT

According to the result the most frequent Bigram word is “ሰበር ዜና” meaning “Breaking News” and the most frequent
Trigram word is “ጠቅላይ ሚኒስትር ዐቢይ meaning “Prime Minister Abiye”.
4.3 Text Vectorizer

Making Use of Countvectorizer will make vectors with the same dimensionality as our vocabulary, and if the text
data contains that vocab word, we'll put a one in that dimension. Every time we come across that word again, we'll add
to the count, leaving 0s where we didn't find it even once. As a result, a very large vector with dimensions of 950 by
118947 is produced. The second method applied to transform the news text is TfidfVectorizer. To remove words which
appear frequently we set the threshold value 70% and words which are occurred more than 70% are discarded from the
vocabulary. Finally 950 by 18947 dimensional vectors are generated to train the classification model.
Figure 4: Most frequent bigram and trigram words
4.4 Fake News Classifier

The main purpose of building machine learning model is to solve a classification problem of fake news. In our
dataset for experiments, the number of authentic news is 67.87 percent and the share of fake news is 32.13 percent. Six
different machine learning classification models are trained using Amharic fake news dataset in this project. The
classification result is reported in the table below and overall performance of each experiment is close to each other. In
the majority of cases, we achieve nearly perfect F1, Recall, Precision and. However, the Precision, Recall, and F1-
Score of the fake class vary from experiment to experiment. The formula shown below calculates Precision (1), Recall
(2), F1 (3) and Accuracy (4) classification of fake news and our mode performed very well.
= ( )
+
= ( )
+
∗ ∗
= ( )
+
= (4)
In the above equation, TP represents the number of correctly classified fake news, FP represents the number of
incorrectly classified fake news in the real news category, FN represents the number of fake news incorrectly classified
in the negative news category, and TR represents the total number of Amharic news in the test data.

www.ijarsct.co.in
IJARSCT

Classifier Vectorizer Accuracy F1 Precision Recall

Naïve TF-IDF 93.18 1.00 0.79 0.88
Bays CountVectorizer 96.38 0.99 0.90 0.94
Passive TF-IDF 96.59 0.99 0.91 0.95
Aggressive CountVectorizer - - - -
Support TF-IDF 94.88 0.98 0.86 0.92
Vector CountVectorizer 94.03 0.96 0.85 0.90
Machine
Logistic TF-IDF 91.68 1.00 0.75 0.85
Regression CountVectorizer 93.39 0.97 0.82 0.89
Random TF-IDF 92.54 0.87 1.00 0.77
Forest CountVectorizer 92.96 0.88 1.00 0.78
Stochastic TF-IDF 94.88 0.92 0.98 0.86
Gradient CountVectorizer 95.1 0.92 0.96 0.89
Decent
Table 1: Fake News Detection Model Evaluation Result

Out of the candidate classifier all models recorded above 90% accuracy for both TF-IDF and Count Vectorizer. The
highest classification result is achieved by passive aggressive classifier (96.59% Accuracy and 0.99 F1-score) and
Naïve Bays classifier (96.38% Accuracy and 0.99 F1-score). As per the evaluation result both vectorizer has similar
significance to the highest performance.
120
100
80
60
40
20 Accuracy
0
F1-Score
CountVectorizer
CountVectorizer
CountVectorizer
CountVectorizer
CountVectorizer
TF-IDF
TF-IDF
TF-IDF
TF-IDF
TF-IDF
TF-IDF
Precision
Recall
NB PAC SVM LR RF SGD
Figure 5: Performance evaluation for different feature extraction
V. CONCLUSION
In this research, we have collected news articles from various Amharic news sources and Facebook pages and the
collected news is subjected to labeling of news. Annotating the news article is laborious task and it involves
professional journalist to label the news as fake and real. By strictly implementing the specified news annotation
guideline around 7547 news are examined, out of the observed news 961 found out to be real news and 457 are fake
news. After annotation of the news dataset the next task is to pre-process the article with certain pre-processing tasks of
tokenization, punctuation and stop word removal. Another finding of this research is building and deploying of
machine learning based Amharic fake news classifier, here the evaluation result of natural language processing and
machine learning based models suggest that naïve bays and passive aggressive classifier with count vectorizer and tf-
idf can perform better classification task with high accuracy and f1-score. The dataset crafted in this research are
www.ijarsct.co.in
IJARSCT

limited in number to conduct extensive analysis for future work crowd-sourcing fake news can build a huge fake news
corpus.
REFERENCES
[1] Viera Maslej Kreˇsˇnáková, Martin Sarnovský, Deep learning methods for Fake News detection, IEEE Joint
19th International Symposium on Computational Intelligence and Informatics and 7th IEEE International
Conference on Recent Achievements in Mechatronics, Automation, Computer Sciences and Robotics •
November 14-16, 2019
[2] Mulat Getaneh Tiruneh, Amharic WordNet construction Using Word Embedding, unpublished master’s thesis,
Addis Abeba University.
[3] Federico Monti, Fabrizio Frasca, Davide Eynard, Damon Mannion, Fake News Detection on Social Media
using Geometric Deep Learning, Feb, 2019
[4] Md Zobaer Hossainy, Md Ashraful Rahmany, Md Saiful Islam, Sudipta Kar, BanFakeNews: A Dataset for
Detecting Fake News in Bangla, Proceedings of the 12th Conference on Language Resources and Evaluation
(LREC 2020), pages 2862–2871, Marseille, 11–16 May 2020
[5] Samir Bajaj, Fake news detection using deep learning, Stanford university, 2017
[6] Maaz Amjad, Grigori Sidorov, Alisa Zhila, Data Augmentation using Machine Translation for Fake News
Detection in the Urdu Language, Proceedings of the 12th Conference on Language Resources and Evaluation
(LREC 2020), pages 2537–2542, Marseille, 11–16 May 2020
[7] Kai Nakamura, Sharon Levy, William Yang Wang, r/Fakeddit: A New Multimodal Benchmark Dataset for
Fine-grained Fake News Detection, University of California,2020
[8] XINYI ZHOU, REZA ZAFARANI, A Survey of Fake News:Fundamental Theories, Detection Methods, and
Opportunities, Syracuse University, USA, July 2020
[9] Paweł Ksieniewicz, Michał Chora´s, Paweł Zyblewski, Rafał Kozik, Michał Wo´zniak, Agata Giełczyk, Fake
News Detection from Data Streams, IEEE ,2020
[10] Julio C. S. Reis, Andr_e Correia, Fabr_ıcio Murai, Adriano Veloso, Fabr_ıcio Benevenuto, Supervised
Learning for Fake News Detection, 1541-1672 _ 2019 IEEE
[11] Julio C. S. Reis, Andr_e Correia, Fabr_ıcio Murai, Adriano Veloso, Fabr_ıcio Benevenuto, Supervised
Learning for Fake News Detection, 1541-1672 _ 2019 IEEE
[12] VasuAgarwala, H.ParveenSultanaa ,SrijanMalhotraa , AmitrajitSarkarb, Analysis of Classifiers for Fake News
Detection, international conference on recent trends in advanced computing , ICRTAC, 2019
[13] Cody Buntain, Jennifer Golbeck, Automatically Identifying Fake News in Popular Twitter Threads, IEEE
International Conference on Smart Cloud, 2017
[14] Rohit Kumar Kaliyar, Anurag Goswami, Pratik Narang, FakeBERT: Fake news detection in social media with
a BERT-based deep learning approach, Springer Science Business Media, LLC, part of Springer Nature 2021
[15] Manav Gupta, Blockchain For Dummies, 3rd IBM Limited Edition, 2020 by John Wiley & Sons, Inc.
[16] Truepic, https://truepic.com/
[17] NewsProvenance Project, https://www.newsprovenanceproject.com/
[18] Voice, https://www.voice.com/
[19] Gereme, F.; Zhu,W.; Ayall,T.; Alemu, D. Combating Fake News in “Low-Resource” Languages: Amharic
Fake News Detection Accompanied by Resource Crafting. Information 2021, 12, 20. https:// doi.org/10.3390/
info12010020
[20] M. Avinash and E. Sivasankar, A Study of Feature Extraction Techniques for Sentiment Analysis, Springer
Nature Singapore Pte Ltd. 2019
[21] Viera Maslej Kreˇsˇnáková, Martin Sarnovský, Deep learning methods for Fake News detection, IEEE Joint
19th International Symposium on Computational Intelligence and Informatics and 7th IEEE International
Conference on Recent Achievements in Mechatronics, Automation, Computer Sciences and Robotics •
November 14-16, 2019
www.ijarsct.co.in

Building A Dataset For Detecting Fake News in Amharic Language

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Building A Dataset For Detecting Fake News in Amharic Language

Uploaded by

Copyright:

Available Formats

ISSN (Online) 2581-9429

Volume 6, Issue 1, June 2021