Professional Documents
Culture Documents
Banfakenews: A Dataset For Detecting Fake News in Bangla
Banfakenews: A Dataset For Detecting Fake News in Bangla
Abstract
Observing the damages that can be done by the rapid propagation of fake news in various sectors like politics and finance, automatic
identification of fake news using linguistic analysis has drawn the attention of the research community. However, such methods are largely
being developed for English where low resource languages remain out of the focus. But the risks spawned by fake and manipulative news
are not confined by languages. In this work, we propose an annotated dataset of ≈ 50K news that can be used for building automated
fake news detection systems for a low resource language like Bangla. Additionally, we provide an analysis of the dataset and develop
arXiv:2004.08789v1 [cs.CL] 19 Apr 2020
a benchmark system with state of the art NLP techniques to identify Bangla fake news. To create this system, we explore traditional
linguistic features and neural network based methods. We expect this dataset will be a valuable resource for building technologies to
prevent the spreading of fake news and contribute in research with low resource languages. The dataset and source code are publicly
available at https://github.com/Rowan1697/FakeNews.
Keywords: Fake news, Bangla, Low Resource Language
• Source: When a dubious news doesn’t contain any We have found that the publishing sites of fake news
reference (an entity who provided the information) are less popular than the sites of true news. So we used
then humans take it as false news. the Alexa Ranking10 of the sites which are designed to
estimate the popularity of websites as a feature. We
• Satire: Humans can detect almost all types of satire didn’t find the rank of some of the news sites so we
news but sometimes some true news also sounds like annotate these with maximum rank from other sites.
satire and humans take them as fake news. And we used the normalized value of ranks as a feature
in experiments.
These analysis indicates that the source plays an influential
role in fake news detection. So In our dataset, we manually 5.2. Neural Network Models
annotated the source of the news so that in future we can
Neural networks are demonstrating impressive perfor-
use the source as a momentous feature for detecting fake
mance in a wide range of text classification and generation
news.
tasks. Given a large amount of training data, such models
typically achieve higher accuracy than linguistic feature
5. Methodologies
based methods. Hence, we experiment with several neural
In this section, we describe the systems we develop to clas- network models that have been used as benchmark models
sify fake news written in Bangla. Our approaches include in different text classification tasks.
traditional linguistic features and multiple neural networks
based models.
7
https://fasttext.cc/docs/en/crawl-
5.1. Traditional Linguistic Features vectors.html
• Lexical Features: Due to the strong performance in 8
http://wikipedia.org
9
various text classification tasks, we extract word n- https://commoncrawl.org/
10
grams (n=1,2,3) and character n-grams (n=3,4,5) from https://www.alexa.com
Convolutional Neural Network (CNN): Convolutional Experiments: With the linguistic features, we experi-
networks have shown effectiveness in classifying short and ment on training a Linear Support Vector Machine (SVM)
long texts in varieties of problems (Kim, 2014; Shrestha (Hearst, 1998), Random Forest (RF) (Liaw and Wiener,
et al., 2017). So we experiment on classifying fake news 2002) and a Logistic Regression (LR) (McCullagh and
using a CNN model similar to (Kim, 2014). We use Nelder, 1989) model. We split our dataset for training and
kernels having lengths from 1 to 4 and empirically use testing in a 70:30 train-to-test ratio. We tune the penalty
256 kernels for each kernel length. As a pooling layer, we parameter (C) based on the validation results.
experiment with global max pool and average pool. We For BiLSTM, CNN, and BERT based experiments,
use ReLU (Agarap, 2018) as the activation function inside the hyper-parameters are Optimizer: Adam Optimizer
the network. (Kingma and Ba, 2015), Learning rate: 0.00002, Batch
size: 32. Hidden size for BERT model is 756 while in CNN
Long Short Term Memory: Due to the capability of cap- and BiLSTM it is 256. For CNN, we use the kernel lengths
turing sequential information in an efficient manner, Long of 1 to 4 and zero right paddings in the experiment. The
Short Term Memory (LSTM) (Hochreiter and Schmidhu- train and test dataset is kept at a 70:30 ratio. And In train-
ber, 1997) networks are one of the most widely used mod- ing, we use 10% of the test data as validation data. We use
els in text classification and generation problems. Specifi- 50 epochs for each experiment and put a checkpoint on F1
cally, bidirectional LSTM (Bi-LSTM) have shown impres- score of fake class using validation data. And finally we
sive performance by capturing sequential information from report the result using our test dataset on the best scoring
the both directions in texts. Moreover, attention mechanism model from training.
has been seen as a strong pooling technique for classifica- In BERT, For fine tuning our dataset we use the sequence
tion tasks when used with Bi-LSTM. classification model by HuggingFace’s Transformers. And
In this work, we experiment with a Bi-LSTM model having we use the BERT’s pre-trained multilingual cased model
attention on top which is similar to (Zhou et al., 2016). We which is trained on 104 different languages12 .
use 256 LSTM units. We use two layers of Bi-LSTM in the
network. 7. Results and Analysis
5.3. Pre-trained Language Model We report our results in Table 3. Overall performance of
every experiment is almost the same. Most of the cases we
Recently pre-trained language models like OpenAI GPT
achieve almost perfect Precision, Recall and F1. But the
(Radford et al., 2018), BERT (Devlin et al., 2018), ULM-
results of Precision, Recall, and F1-Score of fake class vary
FiT (Howard and Ruder, 2018) have made a breakthrough
in experiments to experiments. In our dataset for experi-
in a wide range of NLP tasks. Specifically BERT and
ments, the number of authentic news is 37.47 times higher
its variant models have shown superior performance in
than the number of fake news which could be the reason be-
the GLUE benchmark for Natural Language Understand-
hind such variance in results of the overall and fake class.
ing (NLU) (Wang et al., 2019). To evaluate the scope
To evaluate the performance of different models we will use
of such a language model in our work, we use the multi-
the precision, recall, F1-Score of the fake class in the rest
lingual BERT model to classify news documents. We use
of the section.
the pre-trained model weights and implementation publicly
distributed by HuggingFace’s Transformers (Wolf et al., Fig. 3 shows that experiment with linguistic features with
2019). SVM, RF and LR. Here SVM outperforms LR and RF by
a quite margin except result of the news embedding. In the
6. Experimental Setup case of news embedding RF scores 55% of F1-score and
Data Pre-processing: We perform several pre-processing here SVM, LR scores 46%, 53% of F1-score respectively.
techniques like text normalization and stop words, punctua- For most of the features RF performs better than the LR
tion removal from the data. Here we use Bangla stop words model. We report the result of SVM on Table 3 since it
from Stopwords ISO 11 . We observe better validation per- has performed better than the others. Table 3 shows that
formance by such pre-processing of the data. lexical features perform better than other linguistic features,
Evaluation Metric: We use Micro-F1 scores to evaluate majority & random baselines, and neural network models
different methods. As the dataset is imbalanced, we also as well. It is also observed that the F1-score of fake class in
report the precision (P), recall (R), and F1 score for the mi- the SVM model decreases while increasing the number of
nority class (fake). grams in both word and character n-grams.
Baselines: We compare our experimental results with a ma- The result of POS tag does not improve over the random
jority baseline and a random baseline. The majority base- baselines and the F1-score of this feature indicates that it
line assigns the most frequent class label (authentic news) cannot separate fake news from authentic news. Again F1-
to every article, where the other baseline randomly tags an Score of MP, Word Embedding(Fasttext), Word Embed-
article as authentic or fake. We report the mean of pre- ding(News) are better than the POS tags but fall behind
cision, recall, and F1-score of 10 random baseline experi- the lexical features and neural network models. However,
ments in Table 3. The Standard Deviation(SD) of precision, these features outperform the majority, and random base-
recall, and F1-score in both overall and fake class is less lines. We got our best result when incorporating all linguis-
than 10−2 except the recall of Fake class which is 0.027. tic features with SVM. It scores 91% F1-score.
11 12
https://github.com/stopwords-iso https://github.com/google-research/bert
Figure 3: Comparison between SVM, LR and RF on linguistic features
Overall Fake Class ear classifiers. In CNN, experiment with average pooling,
P R F1 P R F1 global max technique scores 59% and 54% F1-Score re-
Baselines spectively. Generally an attention model with LSTM per-
forms better than CNN (Yang et al., 2016; Chen et al.,
Majority 0.97 1.00 0.99 0.00 0.00 0.00
Random 0.97 0.50 0.66 0.03 0.50 0.05 2016). In our case, the F1-Score of BiLSTM with atten-
tion cannot improve over the CNN model. The best results
Traditional Linguistic Features
of neural networks based experiments came from BERT,
Unigram (U) 0.99 0.99 0.99 0.99 0.71 0.83 whose F1-Score is 68%. Though neural network models
Bigram (B) 0.98 0.99 0.99 0.97 0.42 0.59 perform better than the majority and random baselines still
Trigram (T) 0.98 0.99 0.98 0.74 0.31 0.44
it falls behind the performance of the SVM model.
U+B+T 0.99 0.99 0.99 0.98 0.68 0.80
C3-gram(C3) 0.99 0.99 0.99 0.98 0.82 0.89 To compare the performance of our fake news detection
C4-gram(C4) 0.99 0.99 0.99 0.99 0.78 0.87 system with humans, We tested the same dataset that we
C5-gram(C5) 0.99 0.99 0.99 1.00 0.74 0.85 used for the human baseline with our two best models.
C3+C4+C5 0.99 0.99 0.99 1.00 0.77 0.87 The model with SVM and character 3-gram scores 89%
All Lexical(L) 0.99 0.99 0.99 1.00 0.76 0.86 F1-score where it scores 90% with SVM and all linguistic
POS tag 0.97 1.00 0.98 1.00 0.00 0.01 features. Here, the performance of our best models is al-
L+POS 0.99 0.99 0.99 0.99 0.76 0.86
most similar to the test set. In our human baseline dataset,
Embedding(F) 0.98 0.99 0.99 0.94 0.33 0.49
Embedding(N) 0.98 0.99 0.99 0.84 0.32 0.46 there is 90 authentic news along with 60 fake news. And
L+POS+E(F) 0.99 0.99 0.99 0.99 0.77 0.87 the model with linguistic features has successfully made
L+POS+E(N) 0.99 0.99 0.99 0.98 0.79 0.88 correct assumptions on 50 fake news. On the other hand,
MP 0.97 0.99 0.98 0.94 0.15 0.27 the character 3-gram model has picked out 49 correct as-
L+POS+E(F)+MP 0.99 0.99 0.99 0.99 0.84 0.91 sumptions. Compared to human performance which 64.8%
L+POS+E(N)+MP 0.99 0.99 0.99 0.98 0.84 0.91 of F1-score these two models have performed much better.
All Features 0.99 0.99 0.99 0.98 0.84 0.91
These results suggest that linear classifiers could separate
Neural Network Models the slightest margin of true and authentic news which hu-
CNN 0.98 1.00 0.99 0.79 0.41 0.59 mans usually overlooked.
LSTM 0.99 0.99 0.99 0.69 0.44 0.53
BERT 0.99 1.00 0.99 0.80 0.60 0.68 8. Conclusion
In this paper, we present the first labeled dataset on Bangla
Table 3: Results of experiments with Traditional Linguis- Fake news. Here the evaluation of linear classifiers and
tic Features (SVM) and Neural Networks with test set. P neural network based models suggest that linear classifiers
and R denote precision and recall, respectively. ‘F’ and ‘N’ with traditional linguistic features can perform better than
abbreviate ‘Fastext’ and ‘News’, respectively. the neural network based models. Another key finding is
that character-level features are more significant than the
word-level features. In addition to that, it is also found that
Neural Network models have shown better results in differ- the use of punctuations in fake news is more frequent than
ent text classification problems (Lai et al., 2015; Joulin et authentic news, and most of the time fake news is found on
al., 2016). But in our experiments, we found that F1-Score the least popular sites.
of fake class in neural networks cannot outperform the lin- However, since character level features have shown better
results so we will incorporate the character level features in Dong, X., Victor, U., Chowdhury, S., and Qian, L. (2019).
neural network models such as (Kim et al., 2016; Hwang Deep two-path semi-supervised learning for fake news
and Sung, 2017). From human observation, we found that detection. CoRR, abs/1906.05659.
the source can also play a key role in fake news detection. Fleiss, J. L. (1971). Measuring nominal scale agreement
We will also include this feature in our future experiments. among many raters. Psychological bulletin, 76(5):378.
Besides we will also continue to expand our dataset. We Gairola, S., Lal, Y. K., Kumar, V., and Khattar, D. (2017).
have manually annotated around 8.5K news. We will con- A neural clickbait detection engine. arXiv preprint
tinue this process to reach the 50K mark. We hope our arXiv:1710.01507.
dataset will provide opportunities for other researchers to Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and
use computational approaches to fake news detection. Mikolov, T. (2018). Learning word vectors for 157 lan-
guages. In Proceedings of the International Conference
9. Acknowledgements
on Language Resources and Evaluation (LREC 2018).
We would like to thank data annotators and resource team Hearst, M. A. (1998). Support vector machines. IEEE In-
who helped us during human baseline experiment. Also telligent Systems, 13(4):18–28, July.
we thank Natural Language Processing Group, Dept. of
Hochreiter, S. and Schmidhuber, J. (1997). Long short-
CSE, SUST for their valuable comments and discussions
term memory. Neural computation, 9(8):1735–1780.
with us.
Howard, J. and Ruder, S. (2018). Universal language
10. Bibliographical References model fine-tuning for text classification. arXiv preprint
arXiv:1801.06146.
Agarap, A. F. (2018). Deep learning using rectified linear
units (relu). CoRR, abs/1803.08375. Hwang, K. and Sung, W. (2017). Character-level language
modeling with hierarchical recurrent neural networks.
Ahmad, A. and Amin, M. R. (2016). Bengali word em-
In 2017 IEEE International Conference on Acoustics,
beddings and it’s application in solving document clas-
Speech and Signal Processing (ICASSP), pages 5720–
sification problem. In 2016 19th International Confer-
5724. IEEE.
ence on Computer and Information Technology (ICCIT),
pages 425–430. IEEE. Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.
Ahmad, T., Akhtar, H., Chopra, A., and Akhtar, M. W. (2016). Bag of tricks for efficient text classification.
(2014). Satire detection from web documents using ma- arXiv preprint arXiv:1607.01759.
chine learning methods. In 2014 International Confer- Karadzhov, G., Nakov, P., Màrquez, L., Barrón-Cedeño, A.,
ence on Soft Computing and Machine Intelligence, pages and Koychev, I. (2017). Fully automated fact checking
102–105. IEEE. using external sources. In Proceedings of the Interna-
Biyani, P., Tsioutsiouliklis, K., and Blackmer, J. (2016). tional Conference Recent Advances in Natural Language
” 8 amazing secrets for getting more clicks”: Detecting Processing, RANLP 2017, pages 344–353, Varna, Bul-
clickbaits in news streams using article informality. In garia, September. INCOMA Ltd.
Thirtieth AAAI Conference on Artificial Intelligence. Kim, Y., Jernite, Y., Sontag, D., and Rush, A. M. (2016).
Burfoot, C. and Baldwin, T. (2009). Automatic satire de- Character-aware neural language models. In Thirtieth
tection: Are you having a laugh? In Proceedings of the AAAI Conference on Artificial Intelligence.
ACL-IJCNLP 2009 conference short papers, pages 161– Kim, Y. (2014). Convolutional neural networks for sen-
164. tence classification. In Proceedings of the 2014 Confer-
Chakraborty, A., Paranjape, B., Kakarla, S., and Ganguly, ence on Empirical Methods in Natural Language Pro-
N. (2016). Stop clickbait: Detecting and preventing cessing (EMNLP), pages 1746–1751, Doha, Qatar, Oc-
clickbaits in online news media. In 2016 IEEE/ACM In- tober. Association for Computational Linguistics.
ternational Conference on Advances in Social Networks Kingma, D. P. and Ba, J. (2015). Adam: A method for
Analysis and Mining (ASONAM), pages 9–16. IEEE. stochastic optimization. In Yoshua Bengio et al., editors,
Chen, H., Sun, M., Tu, C., Lin, Y., and Liu, Z. (2016). 3rd International Conference on Learning Representa-
Neural sentiment classification with user and product at- tions, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
tention. In Proceedings of the 2016 Conference on Em- Conference Track Proceedings.
pirical Methods in Natural Language Processing, pages Lai, S., Xu, L., Liu, K., and Zhao, J. (2015). Recurrent
1650–1659, Austin, Texas, November. Association for convolutional neural networks for text classification. In
Computational Linguistics. Twenty-ninth AAAI conference on artificial intelligence.
Ciampaglia, G. L., Shiralkar, P., Rocha, L. M., Bollen, Liaw, A. and Wiener, M. (2002). Classification and regres-
J., Menczer, F., and Flammini, A. (2015). Computa- sion by randomforest. R News, 2(3):18–22.
tional fact checking from knowledge networks. PloS Long, Y., Lu, Q., Xiang, R., Li, M., and Huang, C.-R.
one, 10(6):e0128193. (2017). Fake news detection through multi-perspective
Cohen, J. (1960). A coefficient of agreement for nomi- speaker profiles. In Proceedings of the Eighth Interna-
nal scales. Educational and psychological measurement, tional Joint Conference on Natural Language Processing
20(1):37–46. (Volume 2: Short Papers), pages 252–256, Taipei, Tai-
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). wan, November. Asian Federation of Natural Language
BERT: pre-training of deep bidirectional transformers Processing.
for language understanding. CoRR, abs/1810.04805. Loper, E. and Bird, S. (2002). Nltk: The natural lan-
guage toolkit. In In Proceedings of the ACL Workshop Scholkopf, B. and Smola, A. J. (2001). Learning with
on Effective Tools and Methodologies for Teaching Nat- Kernels: Support Vector Machines, Regularization, Op-
ural Language Processing and Computational Linguis- timization, and Beyond. MIT Press, Cambridge, MA,
tics. Philadelphia: Association for Computational Lin- USA.
guistics. Shrestha, P., Sierra, S., González, F., Montes, M., Rosso, P.,
Manik, J. A. (2012). A hazy picture appears. and Solorio, T. (2017). Convolutional neural networks
https://www.thedailystar.net/news- for authorship attribution of short texts. In Proceedings
detail-252212, October. [Online; posted 03- of the 15th Conference of the European Chapter of the
October-2012]. Association for Computational Linguistics: Volume 2,
McCullagh, P. and Nelder, J. A. (1989). Generalized Lin- Short Papers, pages 669–674, Valencia, Spain, April.
ear Models. Chapman & Hall / CRC, London. Association for Computational Linguistics.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Shu, K., Sliva, A., Wang, S., Tang, J., and Liu, H. (2017).
Dean, J. (2013). Distributed representations of words Fake news detection on social media: A data mining
and phrases and their compositionality. In Proceedings perspective. ACM SIGKDD Explorations Newsletter,
of the 26th International Conference on Neural Infor- 19(1):22–36.
mation Processing Systems - Volume 2, NIPS’13, pages TUFEKCI, Z. (2018). It’s the (democracy-
3111–3119, USA. Curran Associates Inc. poisoning) golden age of free speech. https://
Oraby, S., Reed, L., Compton, R., Riloff, E., Walker, M., www.wired.com/story/free-speech-issue-
and Whittaker, S. (2017). And that’s a fact: Distinguish- tech-turmoil-new-censorship/?CNDID=
ing factual and emotional argumentation in online dia- 50121752, January. [Online; posted 16-January-
logue. arXiv preprint arXiv:1709.05295. 2018].
Pennebaker, J. W., Boyd, R. L., Jordan, K., and Blackburn, Vlachos, A. and Riedel, S. (2014). Fact checking: Task
K. (2015). The development and psychometric proper- definition and dataset construction. In Proceedings of
ties of liwc2015. Technical report. the ACL 2014 Workshop on Language Technologies and
Pérez-Rosas, V., Kleinberg, B., Lefevre, A., and Mihalcea, Computational Social Science, pages 18–22, Baltimore,
R. (2018). Automatic detection of fake news. In Pro- MD, USA, June. Association for Computational Linguis-
ceedings of the 27th International Conference on Com- tics.
putational Linguistics, pages 3391–3401, Santa Fe, New Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and
Mexico, USA, August. Association for Computational Bowman, S. R. (2019). GLUE: A multi-task benchmark
Linguistics. and analysis platform for natural language understand-
Pratiwi, I. Y. R., Asmara, R. A., and Rahutomo, F. (2017). ing. In International Conference on Learning Represen-
Study of hoax news detection using naı̈ve bayes classi- tations.
fier in indonesian language. In 2017 11th International Wang, W. Y. (2017). ” liar, liar pants on fire”: A
Conference on Information & Communication Technol- new benchmark dataset for fake news detection. arXiv
ogy and System (ICTS), pages 73–78. IEEE. preprint arXiv:1705.00648.
Radford, A., Narasimhan, K., Salimans, T., and Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,
Sutskever, I. (2018). Improving language un- C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-
derstanding by generative pre-training. URL icz, M., and Brew, J. (2019). Huggingface’s transform-
https://s3-us-west-2. amazonaws. com/openai- ers: State-of-the-art natural language processing. ArXiv,
assets/researchcovers/languageunsupervised/language abs/1910.03771.
understanding paper. pdf. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy,
Ren, Y. and Zhang, Y. (2016). Deceptive opinion spam E. (2016). Hierarchical attention networks for docu-
detection using neural network. In Proceedings of COL- ment classification. In Proceedings of the 2016 confer-
ING 2016, the 26th International Conference on Compu- ence of the North American chapter of the association for
tational Linguistics: Technical Papers, pages 140–150, computational linguistics: human language technolo-
Osaka, Japan, December. The COLING 2016 Organiz- gies, pages 1480–1489.
ing Committee. Yang, F., Mukherjee, A., and Dragut, E. (2017). Satiri-
Rony, M. M. U., Hassan, N., and Yousuf, M. (2017). Div- cal news detection and analysis using attention mecha-
ing deep into clickbaits: Who use them to what extents nism and linguistic features. In Proceedings of the 2017
in which topics with what effects? In Proceedings of Conference on Empirical Methods in Natural Language
the 2017 IEEE/ACM International Conference on Ad- Processing, pages 1979–1989, Copenhagen, Denmark,
vances in Social Networks Analysis and Mining 2017, September. Association for Computational Linguistics.
pages 232–239. ACM. Zhou, X., Cao, J., Jin, Z., Xie, F., Su, Y., Chu, D., Cao, X.,
Rubin, V., Conroy, N., Chen, Y., and Cornwell, S. (2016). and Zhang, J. (2015). Real-time news cer tification sys-
Fake news or truth? using satirical cues to detect po- tem on sina weibo. In Proceedings of the 24th Interna-
tentially misleading news. In Proceedings of the Second tional Conference on World Wide Web, pages 983–988.
Workshop on Computational Approaches to Deception ACM.
Detection, pages 7–17, San Diego, California, June. As- Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., and Xu,
sociation for Computational Linguistics. B. (2016). Attention-based bidirectional long short-term
memory networks for relation classification. In Proceed- Domain Rank #News
ings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), bdjournal365.com 2395727 1
pages 207–212, Berlin, Germany, August. Association bdsangbad.com 5272008 1
for Computational Linguistics. bdtype.com 309252 1
bn.mtnews24.com 63562 1
Appendix daily-bangladesh.com 6961 1
dkpapers.com 216868 1
A News Sources sangbadprotidin24.com 1489949 1
Domain Rank #News sonalinews.com 37689 1