Professional Documents
Culture Documents
Article QR
DOI: https://doi.org/10.32350.icr.22.01
History: Received: September 20, 2022, Revised: October 28, 2022, Accepted: December 5, 2022
A publication of
School of Systems and Technology
University of Management and Technology, Lahore, Pakistan
Evaluating the Performance of Heterogeneous and
Homogeneous Ensemble-based Models for Twitter Spam
Classification
A. O. Ameen1, A. M. Oyelakin2 , I. K. Ajiboye 3, I. S. Olatinwo4,
K. Y.Obiwusi5, and T. S. Ogundele2
Department of Computer Science, University of Ilorin, Ilorin, Nigeria
1
2
Department of Computer Science, Al-Hikmah University, Ilorin, Nigeria
3
Computer Science Unit, Abdulraheem College of Advanced Studies
An Affiliate of Al-Hikmah Univerisity, Ilorin, Nigeria
4
Department of Computer Science, Federal Polytechnic, Offa, Nigeria
5
Department of Mathematics and Computer Science, Summit University,
Offa, Nigeria
Abstract – Spam based attacks are homogenous and heterogeneous
growing in various social networks. algorithms, behave in Twitter spam
Social network spam is a type of classification. ANOVA-F test was
unwanted content that appears used for selecting the most
on social networking sites, such as promising features in the dataset.
Facebook, Twitter, Instagram, and Then, homogeneous tree-based
others. This study used two Random Forest (RF) ensemble and a
categories of ensemble algorithms to heterogeneous ensemble vote
build Twitter spam classification classifier were employed for the
models. These algorithms worked by classification of Twitter spam. Tree-
combining the strengths of based algorithms were used to build
individual learning algorithms and a homogeneous twitter spam
then reporting their total detection model, while a
performances. In ensemble learning, combination of Support Vector
models are formed from data based Machine (SVM) and Decision Tree
on the assumption that combining (DT) algorithms was used for
the output of multiple models is building the heterogeneous model
better than using a single classifier. (using maximum voting classifier).
Hence, this study used a labeled The current study found that the
public dataset for machine learning- performance of the Twitter spam
based Twitter spam detection. detection models were promising. In
Several studies have investigated the all, the heterogeneous model
classification of Twitter spam from recorded better performance with
the available datasets. However, regards to accuracy, precision,
there is a paucity of works that recall, and F1-score than the model
investigated how machine learning- built with homogeneous base
based models, built with classifier.
*
Corresponding Author: moyelakin80@gmail.com
Innovative Computing Review
2
Volume 2 Issue 2, Fall 2022
Ameen et al.
Table I
Dataset Features and their Description
S/N Attribute Name Description of Attributes
1 account_age The age (days) of an account since its creation
until the time of sending the most recent tweet
2 no_follower The number of followers of this twitter user
3 no_following The number of followings/friends of this
twitter user
4 no_userfavourites The number of favourites this twitter user
received
5 no_lists The number of lists this twitter user added
6 no_tweets The number of tweets this twitter user sent
7 no_retweets The number of retweets
8 no_hashtag The number of hashtags included in this tweet
9 no_usermention The number of user mentions included in this
tweet
10 no_urls The number of URLs included in this tweet
11 no_char The number of characters in this tweet
12 no_digits The number of digits in this tweet
Table II
Sample Size and Featureset in the Datasets
No. of Binary Class
Derived Name for the Input Features
S/N Instances in the (Spam or Non-
Dataset in the Dataset
Dataset spam)
1 TweetContinous1(Dataset1) 10,000 12 YES
2 TweetRandom1(Dataset2) 10,000 12 YES
3 TweetContinous2
100,000 12 YES
(Dataset3)
4 TweetRandom2 (Dataset4) 100,000 12 YES
Table IV
Classification Results of the Heterogeneous Model Based on Maximum
Voting
Model
S/N Learning Algorithm Metric
Performance
CASE 1 (5k continous-Dataset1)
1 Heterogenoeus Ensemble Algorithm Accuracy 0.9922
Precision
2 Heterogenoeus Ensemble Algorithm 0.9918
Score
3 Heterogenoeus Ensemble Algorithm Recall 0.9922
4 Heterogenoeus Ensemble Algorithm F1-score 0.9917
CASE 2 (95k continous-Dataset2)
5 Heterogenoeus Ensemble Algorithm Accuracy 0.9831
Precision
6 Heterogenoeus Ensemble Algorithm 0.9833
Score
7 Heterogenoeus Ensemble Algorithm Recall 0.9831
8 Heterogenoeus Ensemble Algorithm F1-score 0.9817
CASE 3 (5k random-Dataset3)
9 Heterogenoeus Ensemble Algorithm Accuracy 0.9970
Precision
10 Heterogenoeus Ensemble Algorithm 0.9971
Score
11 Heterogenoeus Ensemble Algorithm Recall 0.9970
12 Heterogenoeus Ensemble Algorithm F1-score 0.9970
CASE 4 (95k random-Dataset4)
13 Heterogenoeus Ensemble Algorithm Accuracy 0.9987
Precision
14 Heterogenoeus Ensemble Algorithm 0.9986
Score
spam datasets. The datasets used in vol. 186, Art. no. 115742, Dec.
this study were binary in nature. 2021, doi:
Several experiments were carried https://doi.org/10.1016/j.eswa.2
out which invovled using the same 021.115742
type of base classifiers (Decision [2] A. Pektaş and T. Acarman,
Tress) to build a RF-based model.
“Botnet detection based on
Furthermore, SVM-based and DT-
network flow summary and
based classifers were used to build deep learning,” Int. J. Netw.
a heterogeneous model. During the
Manag., vol. 28, no. 6, pp. 1–
experiments, varying random split 15., July 2018, doi:
ratios were used to achieve model https://doi.org/10.1002/nem.20
validation. Good results were 39
recorded at the split ratios of 75:25
for training and testing sets, [3] B. Markines, C. Cattuto, and F.
respectively. Good performance of Menczer, “Social spam
the models was judged based on the detection,” Proc. 5th Int.
four selected metrics of accuracy, Workshop Advers. Inform.
precision, recall, and F1-score. It Retrien. Web – AIRWeb, 2009.
was observed that the model built [4] S. Penchikala, “Big data
with heterogeneous machine processing with apache spark-
learning-based algorithms part 4,” Spark Mach. Lear.,
outperformed the ones built with 2016
homgeneous (Tree-based)
algorithms. [5] D. Opitz, and R. Maclin,
“Popular ensemble methods:
Acknowledgement An empirical study,” J. Artif.
Authors wish to akcnowldge the Intell. Res., vol. 11, pp. 169–
constructive comments of 198, 1999.
anonynous reviewers who made it [6] R. Polikar. (2006). Ensemble
possible for them to achieve the based systems in decision
improved manuscript. making, IEEE Circuits and
References Systems Magazine. 6 (3): 21–
45. doi:10.1109/MCAS.2006.16
[1] S. Rao, A. K. Verma, and T. 88199. S2CID 18032543.
Bhatia, “A review on social
spam detection: Challenges, [7] Rokach, L., “Ensemble-based
open issues, and future classifiers”. Artificial
directions,” Expert Sys. Appl.. Intelligence