Professional Documents
Culture Documents
IJARSCT
International Journal of Advanced Research in Science, Communication and Technology (IJARSCT)
Abstract: The rapid growth and expansion of social media platform has filled the gap of information
exchange in the day to day life. Apparently, social media is the main arena for disseminating
manipulated information in a high range and exponential rate. The fabrication of twisted information is
not limited to ones language, society and domain, this is particularly observed in the ongoing COVID-19
pandemic situation. The creation and propagation of fabricated news creates an urgent demand for
automatically classification and detecting such distorted news articles. Manually detecting fake news is a
laborious and tiresome task and the dearth of annotated fake news dataset to automate fake news
detection system is still a tremendous challenge for low-resourced Amharic language (after Arabic, the
second largely spoken Semitic language group). In this study, Amharic fake news dataset are crafted
from verified news sources and various social media pages and six different machine learning classifiers
Naïve bays, SVM, Logistic Regression, SGD, Random Forest and Passive aggressive Classifier model
are built. The experimental results show that Naïve bays and Passive Aggressive Classifier surpass the
remaining models with accuracy above 96% and F1- score of 99%. The study has a significant
contribution to turn down the rate of disinformation in vernacular language.
Keywords: Amharic, Fake News, Machine Learning, Natural Language Processing
I. INTRODUCTION
Fake news refers to news stories that are false or nonfactual: the storey is manipulated, with no authenticated facts,
sources, or quotes. Sometimes these stories are brainwashing intended to deceive the reader, or they are designed as
clickbait. Clickbait is written mainly for getting economic advantage by boosting the news headline to get more number
of clicks from the readers. The general anti-social behavior of social media users can be generally categorized in two
main parts [1]. The first group is the spreader of misinformation, which can take many forms, such as hoaxes and the
dissemination of fake news on the internet. The second group is knows by reacting to specific users, which could
include discussion manipulation, cyber bullying, or other forms of similar behavior. Both types of anti-social behavior
pose a serious problem because the consequences can be crucial in the real world.
Amharic is a Semitic language family. It is the world's second most spoken Semitic language, after Arabic. The
federal government of Ethiopia used Amharic language as a working language. In addition to that, it is also the working
language (lingua franca) of several regional states and cities within the federal government [2]. The huge acceptance of
social media and the rapid increase of social media users in the country contributed as a venue for spreading of
fabricated information. Dissemination of unverified or falsehood information in social platform especially in Facebook
sometimes serves as a potential cause to spark ethnic and religious clashes. The common method of checking whether
the news is fake or real is done in traditional and manual approach of verifying from different legit sources, the time
and effort requires to manually verifying the information is high.
A well know Amharic news checker channel called Ethiopia Check conducted a poll for 54,400 participant in its
Telegram channel about social media users experience on dealing with misinformation and the result revealed that 58%
of the readers are decided to further checking the truthfulness of the news from other verified news sources. The other
38% of the respondent are only consume the news without doing any further verification. Another poll conducted by
the same news checker channel asked 2209 respondent about the main contributor of fabricating misinformation. Most
of the respondent believes that the so called Activists takes the lion’s share by producing unverified news, Bloggers,
Broadcast Medias and Journalists, Government organization and opposition bodies follows sequentially.
In this research work, we proposed to build Amharic fake news dataset from various news sources. The major
contribution of this research is
1. A Fake news dataset is constructed in Amharic language.
2. Different machine learning models are proposed to detect Amharic fake news.
The rest of this paper is organized as follows. Section two discusses the existing approach of building fake news dataset
and classifier in different language. In section three details of the framework and the proposed system architecture is
presented. In section Four the experimental result and evaluation metrics of the study is presented. Possible future
extension, contribution and conclusion of the work also will be presented in Chapter Five.
To assess the accuracy of two credibility-focused Twitter datasets called PHEME (a dataset of potential gossip on
Twitter and journalistic assessments of their accuracy) and CREDBANK (a crowd-sourced dataset of accuracy
assessments for events on Twitter) [13] creates automatic detecting fake news method on Twitter. This paper shows
how an automated system can detect fake news in popular Twitter threads. Moreover, the study concludes that
employing non-professional, crowd-sourced volunteers rather than journalists’ experts provides a useful and
inexpensive way to rapidly classify real and fake stories on Twitter.
Many helpful methods for fake news detection have been developed in recent studies, including the use of sequential
neural networks to encode news content and social context-level information, where the text sequence was analyzed in
a unidirectional manner. As a result, for detecting fake news, a bidirectional training approach [14] is used; this
approach is based on the BERT (Bidirectional Encoder Representations from Transformers) deep learning approach.
BERT is used as a sentence encoder, and it can accurately obtain a sentence's context representation.
Recently Blockchain are joined the battle field to fight hoax news. Apart from voting, verifying resumes, supply
chains, and many other areas Blcokchain solution are applying to fight misinformation [15]. Some of Blockchain
enabled fake news detections are Truepic [16], News provenance project [17] is lead by The New York Times
researcher and Voice [18].
The target language of this study is Amharic language. Amharic is one of morphologically rich languages and has its
own script called Fidel. It is related to Arabic and Hebrew, and it is the second most widely spoken Semitic language
after Arabic, with 22 million native speakers. [19] Conducted a research on Amharic fake news detection using deep
learning and news content, as well as the development of several computational linguistic resources for this under
resource language. However, due to unavailability of online labelled fake news dataset for this language makes it
harder to conduct research and provide convenient solution. The dire need of fighting misinformation in vernacular
language derives the author of this research to build a fake news dataset and classification model from the scratch.
III. METHOD
The availability and quality of annotated datasets is the most significant challenge for automated Amharic fake news
detection. The scarcity of manually labeled fake news datasets is undoubtedly a barrier to the advancement of
computationally intensive, text-based models that cover a wide range of topics. The dataset for the fake news challenge
does not fit our needs because it contains the ground truth about the relationships between texts but not whether those
texts are true or false statements.
3.3 Pre-processing
Like the other languages, the Amharic language also has its own punctuation marks which separate texts or
sentences into a stream of words. Amharic punctuation marks include ‘Hulet netb’ or colon (፡), ‘arat netb ‘(።), ‘netela
serez’(፣), ‘derib sereze’ (፤), question mark’?’ and exclamation mark ‘! Or ¡’ are used as sentence delimiter or as white
space [2]. In this paper work, we take all the punctuation marks as word delimiter but ‘arat netb’, ‘hulet netb’, and
white space are widely used applied to tokenize the word efficiently.
According to the result the most frequent Bigram word is “ሰበር ዜና” meaning “Breaking News” and the most frequent
Trigram word is “ጠቅላይ ሚኒስትር ዐቢይ meaning “Prime Minister Abiye”.
= ( )
+
= ( )
+
∗ ∗
= ( )
+
= (4)
In the above equation, TP represents the number of correctly classified fake news, FP represents the number of
incorrectly classified fake news in the real news category, FN represents the number of fake news incorrectly classified
in the negative news category, and TR represents the total number of Amharic news in the test data.
CountVectorizer
CountVectorizer
CountVectorizer
CountVectorizer
TF-IDF
TF-IDF
TF-IDF
TF-IDF
TF-IDF
TF-IDF
Precision
Recall
V. CONCLUSION
In this research, we have collected news articles from various Amharic news sources and Facebook pages and the
collected news is subjected to labeling of news. Annotating the news article is laborious task and it involves
professional journalist to label the news as fake and real. By strictly implementing the specified news annotation
guideline around 7547 news are examined, out of the observed news 961 found out to be real news and 457 are fake
news. After annotation of the news dataset the next task is to pre-process the article with certain pre-processing tasks of
tokenization, punctuation and stop word removal. Another finding of this research is building and deploying of
machine learning based Amharic fake news classifier, here the evaluation result of natural language processing and
machine learning based models suggest that naïve bays and passive aggressive classifier with count vectorizer and tf-
idf can perform better classification task with high accuracy and f1-score. The dataset crafted in this research are
Copyright to IJARSCT DOI: 10.48175/IJARSCT-1362 82
www.ijarsct.co.in
ISSN (Online) 2581-9429
IJARSCT
International Journal of Advanced Research in Science, Communication and Technology (IJARSCT)
limited in number to conduct extensive analysis for future work crowd-sourcing fake news can build a huge fake news
corpus.
REFERENCES
[1] Viera Maslej Kreˇsˇn´akov´a, Martin Sarnovsk´y, Deep learning methods for Fake News detection, IEEE Joint
19th International Symposium on Computational Intelligence and Informatics and 7th IEEE International
Conference on Recent Achievements in Mechatronics, Automation, Computer Sciences and Robotics •
November 14-16, 2019
[2] Mulat Getaneh Tiruneh, Amharic WordNet construction Using Word Embedding, unpublished master’s thesis,
Addis Abeba University.
[3] Federico Monti, Fabrizio Frasca, Davide Eynard, Damon Mannion, Fake News Detection on Social Media
using Geometric Deep Learning, Feb, 2019
[4] Md Zobaer Hossainy, Md Ashraful Rahmany, Md Saiful Islam, Sudipta Kar, BanFakeNews: A Dataset for
Detecting Fake News in Bangla, Proceedings of the 12th Conference on Language Resources and Evaluation
(LREC 2020), pages 2862–2871, Marseille, 11–16 May 2020
[5] Samir Bajaj, Fake news detection using deep learning, Stanford university, 2017
[6] Maaz Amjad, Grigori Sidorov, Alisa Zhila, Data Augmentation using Machine Translation for Fake News
Detection in the Urdu Language, Proceedings of the 12th Conference on Language Resources and Evaluation
(LREC 2020), pages 2537–2542, Marseille, 11–16 May 2020
[7] Kai Nakamura, Sharon Levy, William Yang Wang, r/Fakeddit: A New Multimodal Benchmark Dataset for
Fine-grained Fake News Detection, University of California,2020
[8] XINYI ZHOU, REZA ZAFARANI, A Survey of Fake News:Fundamental Theories, Detection Methods, and
Opportunities, Syracuse University, USA, July 2020
[9] Paweł Ksieniewicz, Michał Chora´s, Paweł Zyblewski, Rafał Kozik, Michał Wo´zniak, Agata Giełczyk, Fake
News Detection from Data Streams, IEEE ,2020
[10] Julio C. S. Reis, Andr_e Correia, Fabr_ıcio Murai, Adriano Veloso, Fabr_ıcio Benevenuto, Supervised
Learning for Fake News Detection, 1541-1672 _ 2019 IEEE
[11] Julio C. S. Reis, Andr_e Correia, Fabr_ıcio Murai, Adriano Veloso, Fabr_ıcio Benevenuto, Supervised
Learning for Fake News Detection, 1541-1672 _ 2019 IEEE
[12] VasuAgarwala, H.ParveenSultanaa ,SrijanMalhotraa , AmitrajitSarkarb, Analysis of Classifiers for Fake News
Detection, international conference on recent trends in advanced computing , ICRTAC, 2019
[13] Cody Buntain, Jennifer Golbeck, Automatically Identifying Fake News in Popular Twitter Threads, IEEE
International Conference on Smart Cloud, 2017
[14] Rohit Kumar Kaliyar, Anurag Goswami, Pratik Narang, FakeBERT: Fake news detection in social media with
a BERT-based deep learning approach, Springer Science Business Media, LLC, part of Springer Nature 2021
[15] Manav Gupta, Blockchain For Dummies, 3rd IBM Limited Edition, 2020 by John Wiley & Sons, Inc.
[16] Truepic, https://truepic.com/
[17] NewsProvenance Project, https://www.newsprovenanceproject.com/
[18] Voice, https://www.voice.com/
[19] Gereme, F.; Zhu,W.; Ayall,T.; Alemu, D. Combating Fake News in “Low-Resource” Languages: Amharic
Fake News Detection Accompanied by Resource Crafting. Information 2021, 12, 20. https:// doi.org/10.3390/
info12010020
[20] M. Avinash and E. Sivasankar, A Study of Feature Extraction Techniques for Sentiment Analysis, Springer
Nature Singapore Pte Ltd. 2019
[21] Viera Maslej Kreˇsˇn´akov´a, Martin Sarnovsk´y, Deep learning methods for Fake News detection, IEEE Joint
19th International Symposium on Computational Intelligence and Informatics and 7th IEEE International
Conference on Recent Achievements in Mechatronics, Automation, Computer Sciences and Robotics •
November 14-16, 2019
Copyright to IJARSCT DOI: 10.48175/IJARSCT-1362 83
www.ijarsct.co.in