Professional Documents
Culture Documents
Authors’ contributions
This work was carried out in collaboration between both authors. Both authors read and approved the
final manuscript.
Received: 09/01/2023
Review Article Accepted: 15/03/2023
Published: 17/03/2023
ABSTRACT
In practically every industry today, from business to education, emails/messages are used. Ham
and spam are the two subcategories of emails/messages. Email or message spam, often known as
junk email or unwelcome email, is a kind of message that can be used to hurt any user by sapping
their time and computing resources and stealing important data. Spam messages volume is rising
quickly day by day. Today's email and IoT service providers face huge and massive challenges with
spam identification and filtration. Spam filtering is one of the most important and well-known
methods among all the methods created for identifying and preventing spam. This has been
accomplished using a number of machine learning and deep learning techniques, including Naive
Bayes, decision trees, neural networks, and random forests. By categorizing them into useful
groups, this study surveys the machine learning methods used for spam filtering. Based on
accuracy, precision, recall, etc., a thorough comparison of different methods is also made.
SMS Spam
Machine Deep
Bert
Learning Learning Spam
Technique
Method Methods detection
by using
DL
UnSupervised Application
Supervide
Methods Methods
The fundamental problem with spam is that there most likely better to utilize the expression "non-
is no longer any privacy; when someone spam", all things considered [16,17].
responds to these SMS messages, privacy is
violated. Because it uses a single click to assault 2. SPAM FILTRATION TECHNIQUES
the privacy bridge [7,8]. The easiest tool for a
cyber-attack is a mobile phone. Research has Spam emails are becoming more and more
shown that more than 200 million mobile users prevalent in politics, education, chain messaging,
receive spam SMS in a single day, which is stock market recommendations, and marketing
insufficient [9-11]. [18]. For effective spam identification and
filtering, numerous businesses are currently
1.1 What Is Spam? developing various methods and algorithms. In
order to comprehend the filtering process, we
discuss a few filtering mechanisms in this part.
Unwanted and unpleasant text messages in the
form of spam are those that we repeatedly 2.1 The Common Spam Filtering
getviaa transmission channel. Spam messages
Technique
have an impact on a device's performance,
power, and storage system. In short, spam has A filtering system that employs a set of rules and
proven to be the most unpleasant aspect of uses those set of protocols as a classifier is
personal communication [12,13]. known as standard spam filtering. The first phase
is the implementation of content filters, which
1.2 What Is Ham? identify spam using artificial intelligence
methods. The second phase involves the
Ham refers to messages that we receive from implementation of the email header filter, which
end devices that are not spam and are on a good extracts the header data from the email. After
list of requested and wanted messages. About that, backlist filters are applied to the emails to
2001, Spam Bayes first used the term "ham," weed out spam emails by securing the emails
which is currently recognized to mean "e-mail originating from the backlist file. The next step is
and messages that are commonly appreciated the implementation of rule-based filters, which
and aren't deemed spam [14,15]. identify the sender based on the subject line and
user-defined characteristics. Finally, a technique
Its utilization is especially normal among anti- that enables the account holder to send
spam software developers, and not broadly messages is implemented to use allowance and
known somewhere else; as a general rule, it is task filters [19-25].
168
Nallamothu and Khan; Asian J. Adv. Res., vol. 6, Issue 1, Page 167-179, 2023; Article no.AJOAIR.2465
Challegend Rulebased
based filter Filter
169
Nallamothu and Khan; Asian J. Adv. Res., vol. 6, Issue 1, Page 167-179, 2023; Article no.AJOAIR.2465
2.2 Filtering of Spam on the Client Side linked to Artificial Intelligence. There are two
parts of machine learning methods
A client is a person who has access to an email
network or the Internet and can send or receive Supervised
emails. Several rules and procedures for Unsupervised
ensuring secure communications transmission
between persons and organizations are offered 3.1 Supervised Learning Algorithm
by spam detection at the client point. A client
needs install various working frameworks on his With the help of two datasets, various
or her system for data transmission. By characterization techniques for SMS spam
connecting with client mail agents and discovery are evaluated (which were gathered
composing, receiving, and handling the incoming from free sources). Datasets are organized using
emails, such systems filter the client's mailbox preprocessing techniques such as tokenization
[26-28]. and Tf-IDF, as well as correlation algorithms and
deep learning classifiers (the choice tree, SVM,
2.3 Commercial-Grade Spam Filtering the calculated relapse, ANN, the arbitrary
timberland, the AdaBoost, CNN, NB) [38-40].
The process of detecting email spam at the
enterprise level involves installing different Three computations, including SVM, NB, and the
filtering frameworks on the server, interacting highest entropy calculation, were used by [6] to
with the mail transfer agent, and categorizing the identify spam and ham communications. Due to
gathered emails as either spam or ham. This the fact that many of the words are often used,
system client employs the system regularly and spam is difficult to identify. Fundamentally, SMS
successfully on a network where emails are spamming is a form of email spamming. It was
filtered using an enterprise filtering technique. discovered that SVM provides more accurate
The rule of ranking the email is used by existing results than Gullible Bayes and greatest entropy
spam detection techniques. This principle by using the Spam SMS dataset, which has
specifies a ranking function and generates a about 5574 records and is prepared by using
score for each post. A certain score or rating is stop word evacuation and tokenization. They
assigned to the spam or ham message. Since achieved 97.4% precision using SVM
spammers employ various strategies, all jobs are [14,15,38,41,42].
routinely adjusted by adding a list-based
technique to automatically block the messages 3.2 Unsupervised Learning Algorithm
[29-31].
For spam identification, Weka and RapidMiner,
2.4 Spam Filtering Using Cases two different arranging tools, were used. They
use AI calculations for grouping and ordering,
and to verify the precision of these calculations,
The case-based or sample-based spam filtering
they used a dataset that can be downloaded
system is one of the well-known and traditional
from UC Irvine. Findings demonstrate that Weka
machine learning techniques for spam
SVM acquired greater precision of 99.3% in 1.54
detection. With the help of the collection method,
seconds for grouping and K-Means in 2.7
this type of filtering has multiple stages; the first
seconds is amazing for bunching. With
one involves gathering data (email). The key
RapidMiner SVM, results are provided in 21
change then continues with the client's graphical
seconds with 96.64% precision and in 37.0
user interface preparation processes, outlining
seconds with K-Means [51-56].
abstraction, and selection of email data
categorization, evaluating the entire process
3.3 Deep Learning Methods
using vector expression and categorizing the
data into two groups: spam and genuine email Here, spam and non-spam messages from
[32-36]. clients are separated using convolutional neural
organization. The class of spam SMS was
3. MACHINE LEARNING ALGORITHMS identified using Tiago's dataset for its evaluation.
To increase the exactness rate, steps for
Machine learning methods support data tokenization and stop word preparation are also
prediction and data classification which are explained [61-63].
170
Nallamothu and Khan; Asian J. Adv. Res., vol. 6, Issue 1, Page 167-179, 2023; Article no.AJOAIR.2465
171
Nallamothu and Khan; Asian J. Adv. Res., vol. 6, Issue 1, Page 167-179, 2023; Article no.AJOAIR.2465
172
Nallamothu and Khan; Asian J. Adv. Res., vol. 6, Issue 1, Page 167-179, 2023; Article no.AJOAIR.2465
173
Nallamothu and Khan; Asian J. Adv. Res., vol. 6, Issue 1, Page 167-179, 2023; Article no.AJOAIR.2465
174
Nallamothu and Khan; Asian J. Adv. Res., vol. 6, Issue 1, Page 167-179, 2023; Article no.AJOAIR.2465
A method for discretely identifying spam the majority of the proposed spam detection
messages on the cell phone survey layer in real approaches. The supervised model training
world datasets. They make use of two octet- process depends on a large and time-consuming
based components: octet bigrams and recursion labelled dataset. SVM and Naive Bayes,
dispersion of bytes. The directed classifier supervised learning algorithms, outperform other
displays higher exactness of 93% with a no models in spam identification. The report offers
bogus rate alert in a comparison between in-depth analyses of these algorithms as well as
evolutionary classifiers (the fluffy Ada help, the some suggestions for further research in spam
directed classifier, the hereditary classifier, and filtering and detection.
the lengthy classifier) and non-developmental
classifiers (the K closest neighbour, the Naive COMPETING INTERESTS
Bayes Algorithm, C4.5, Jrip, and SVM) [64,65].
Authors have declared that no competing
4. CONCLUSIONS
interests exist.
Over the past two decades, a sizable research
community has become interested in spam REFERENCES
identification and filtration. Many studies have
been conducted in this field because to its 1. Faris H, Al-Zoubi AM, Heidari AA, Aljarah I,
expensive and significant impact in a variety of Mafarja M, Hassonah MA, et al. An
situations, including customer behavior and intelligent system for spam detection and
bogus reviews. The survey covers different identification of the most relevant features
machine learning methods and models that based on evolutionary random weight
different researchers have suggested for spam networks. Inf Fusion. 2019;48:67-83.
detection and filtering. The study divided them DOI: 10.1016/j.inffus.2018.08.002
into categories including unsupervised learning, 2. Blanzieri E, Bryl A. A survey of learning-
supervised and so forth. The study contrasts based techniques of email spam filtering.
different methods and gives an overview of the Artif Intell Rev. 2008;29(1):63-92.
key takeaways for each group. This study comes DOI: 10.1007/s10462-009-9109-6
to the conclusion that supervised machine 3. Choudhary K, Garrity KF, Reid ACE,
learning techniques constitute the foundation of DeCost B, Biacchi AJ, Hight Walker AR, et
175
Nallamothu and Khan; Asian J. Adv. Res., vol. 6, Issue 1, Page 167-179, 2023; Article no.AJOAIR.2465
al. The joint automated repository for reinforcement learning. Decis Support
various integrated simulations (JARVIS) for Syst. 2018;107:88-102.
data-driven materials design. npj Comp DOI: 10.1016/j.dss.2018.01.001
Mater. 2020;6(1):173. 14. Narisawa K, et al. Unsupervised spam
DOI: 10.1038/s41524-020-00440-1 detection based on string alienness
4. Kirklin S, Saal JE, Meredig B, Thompson measures. in Discovery Science: 10th
A, Doak JW, Aykol M, et al. The open International Conference. Proceedings, DS
quantum materials database (OQMD): 2007 Sendai, Japan. Springer. 2007;10.
assessing the accuracy of DFT formation 15. Sasaki M, Shinnou H. Spam detection
energies. npj Comp Mater. 2015;1(1):1-15. using text clustering. In: International
DOI: 10.1038/npjcompumats.2015.10 Conference on Cyberworlds (CW’05). Vol.
5. Jain A, Ong SP, Hautier G, Chen W, 2005. IEEE Publications; 2005.
Richards WD, Dacek S, et al. DOI: 10.1109/CW.2005.83
Commentary: The materials project: A 16. Kruschke JK, Liddell TM. Bayesian data
materials genome approach to analysis for newcomers. Psychon Bull Rev.
accelerating materials innovation. APL 2018;25(1):155-77.
Mater. 2013;1(1):011002. DOI: 10.3758/s13423-017-1272-1, PMID
DOI: 10.1063/1.4812323 28405907.
6. Alghoul A, et al. Email classification using 17. Adewole KS, Anuar NB, Kamsin A,
artificial neural network; 2018. Varathan KD, Razak SA. Malicious
7. Udayakumar N, Anandaselvi S, accounts: Dark of the social networks. J
Subbulakshmi T. Dynamic malware Netw Comput Appl. 2017;79:41-67.
analysis using machine learning algorithm. DOI: 10.1016/j.jnca.2016.11.030
In: International Conference on Intelligent 18. Zhuang L, et al. Characterizing botnets
Sustainable Systems (ICISS). Vol. 2017. from email spam records. Leet. 2008;
IEEE Publications; 2017. 8(1):1-9.
DOI: 10.1109/ISS1.2017.8389286 19. Barushka A, Hájek P. Spam filtering using
8. Olatunji SO. Extreme learning machines regularized neural networks with rectified
and support vector machines models for linear units. In: AI* IA 2016 advances in
email spam detection. In: 30th Canadian artificial intelligence. XVth International
Conference on Electrical and Computer Conference of the Italian Association for
Engineering (CCECE). IEEE Publications. Artificial Intelligence, Genova, Italy,
IEEE Publications; 2017. November 29 - December 1, 2016,
DOI: 10.1109/CCECE.2017.7946806 proceedings. Springer; 2016;XV:
9. Dou Y, Ma G, Yu PS, Xie S. Robust 65-75.
spammer detection by nash reinforcement DOI: 10.1007/978-3-319-49130-1_6
learning. In: Proceedings of the 26th ACM 20. Jamil F, Kahng HK, Kim S, Kim DH.
SIGKDD international conference on Towards secure fitness framework based
knowledge discovery & data mining; on IoT-enabled blockchain network
2020:924-33. integrated with machine learning
DOI: 10.1145/3394486.3403135 algorithms. Sensors (Basel). 2021;
10. Lai G-H, Chen C, Laih C, Chen T. A 21(5):1640.
collaborative anti-spam system. Expert DOI: 10.3390/s21051640, PMID
Syst Appl. 2009;36(3):6645-53. 33652773.
DOI: 10.1016/j.eswa.2008.08.075 21. Arif MH, Li J, Iqbal M, Liu K. Sentiment
11. Dean J. Large scale deep learning. In: analysis and spam detection in short
Keynote GPU Technical Conference; informal text using learning classifier
2015. systems. Soft Comput. 2018;22(21):
12. Chiu Y-F, Chen C, Jeng B, Lin H. An 7281-91.
alliance-based anti-spam approach. In: DOI: 10.1007/s00500-017-2729-x
Third International Conference on Natural 22. Zheng X, Zhang X, Yu Y, Kechadi T, Rong
Computation (ICNC 2007). IEEE C. ELM-based spammer detection in social
Publications; 2007. networks. J Supercomput. 2016;72(8):
DOI: 10.1109/ICNC.2007.173 2991-3005.
13. Smadi S, Aslam N, Zhang L. Detection of DOI: 10.1007/s11227-015-1437-5
online phishing email using dynamic 23. Cresci S, Petrocchi M, Spognardi A,
evolving neural network based on Tognazzi S. On the capability of evolved
176
Nallamothu and Khan; Asian J. Adv. Res., vol. 6, Issue 1, Page 167-179, 2023; Article no.AJOAIR.2465
177
Nallamothu and Khan; Asian J. Adv. Res., vol. 6, Issue 1, Page 167-179, 2023; Article no.AJOAIR.2465
43. Kotsiantis SB, Zaharakis I, Pintelas P. 54. Hsiao W-F, Chang T-M. An incremental
Supervised machine learning: A review of cluster-based approach to spam filtering.
classification techniques. Emerg Artif Intell Expert Syst Appl. 2008;34(3):1599-608.
Appl Comput Eng. 2007;160(1):3-24. DOI: 10.1016/j.eswa.2007.01.018
44. Jiang S, Pang G, Wu M, Kuang L. An 55. Ahuja R, Chug A, Gupta S, Ahuja P, Kohli
improved K-nearest-neighbor algorithm for S. Classification and clustering algorithms
text categorization. Expert Syst Appl. of machine learning with their applications.
2012;39(1):1503-9. In: Nature-inspired computation in data
DOI: 10.1016/j.eswa.2011.08.040 mining and machine learning; 2020. p.
45. Fine S, Singer Y, Tishby N. The 225-48.
hierarchical hidden markov model: DOI: 10.1007/978-3-030-28553-1_11
Analysis and applications. Mach Learn. 56. Li W, Meng W, Tan Z, Xiang Y. Design of
1998;32(1):41-62. multi-view based email classification for
DOI: 10.1023/A:1007469218079 IoT systems via semi-supervised learning.
J Netw Comput Appl. 2019;128:56-63.
46. Abe N, Warmuth MK. On the
DOI: 10.1016/j.jnca.2018.12.002
computational complexity of approximating
distributions by probabilistic automata. 57. Diale M, Celik T, Van Der Walt C.
Mach Learn. 1992;9(2-3):205-60. Unsupervised feature learning for spam
email filtering. Comput Electr Eng.
DOI: 10.1007/BF00992677
2019;74:89-104.
47. Baldi P, Chauvin Y, Hunkapiller T, McClure DOI: 10.1016/j.compeleceng.2019.01.004
MA. Hidden Markov models of biological 58. Peng W, Huang L, Jia J, Ingram E.
primary sequence information. Proc Natl Enhancing the naive Bayes spam filter
Acad Sci U S A (USA). 1994;91(3):1059- through intelligent text modification
63. detection. In: 17th IEEE International
DOI: 10.1073/pnas.91.3.1059, PMID Conference On Trust, Security And Privacy
8302831. In Computing And Communications.
48. Bengio Y, Frasconi P. An input-output 2018;2018.
HMM architecture. In: Tesauro G, DOI:10.1109/TrustCom/BigDataSE.2018.0
Touretzky DS, Leen TK, editors. Advances 0122
in neural information processing systems. 59. Zeng Z, et al. Spammer Detection on
Cambridge, MA: MIT Press; 1995. Weibo Social Network. In: IEEE 6th
49. Gat I, Tishby N, Abeles M. Hidden Markov International Conference on Cloud
modeling of simultaneously recorded cells Computing Technology and Science.
in the associative cortex of behaving 2014;2014.
monkeys. Netw Comput Neural Syst. 60. Lei K, Liu Y, Zhong S, Liu Y, Xu K, Shen Y,
1997;8. et al. Understanding user behavior in Sina
50. Cover T, Thomas J. Elements of Weibo online social network: A community
information theory. Wiley; 1991. approach. IEEE Access. 2018;6:13302-16.
51. Ahmed AH, Mikki M. Improved spam DOI: 10.1109/ACCESS.2018.2808158
detection using DBSCAN and advanced 61. Lin C, He J, Zhou Y, Yang X, Chen K,
digest algorithm. Int J Comput Appl. Song L. Analysis and identification of
2013;69(25):11-6. spamming behaviors in Sina Weibo
DOI: 10.5120/12126-8300 microblog. In: Proceedings of the 7th
52. Tan E, Guo L, Chen S, Zhang X, Zhao Y. workshop on social network mining and
Unik: unsupervised social network spam analysis. Chicago: Association for
detection. In: Proceedings of the 22nd Computing Machinery. 2013:Article 5.
ACM international conference on DOI: 10.1145/2501025.2501035
information & knowledge management; 62. Rusland, N.F., et al. Analysis of naïve
2013:479-88. Bayes algorithm for Email spam filtering
DOI: 10.1145/2505515.2505581 across multiple datasets. IOP Conf S
53. Sharma A, Rastogi V. Spam filtering using Mater Sci Eng. 2017;226(1):012091.
K mean clustering with local feature 63. Singh A, Batra S. Ensemble based spam
selection classifier. Int J Comput Appl. detection in social IoT using probabilistic
2014;108(10):35-9. data structures. Future Gener Comput
Syst. 2018;81:359-71.
DOI: 10.5120/18951-0096
DOI: 10.1016/j.future.2017.09.072
178
Nallamothu and Khan; Asian J. Adv. Res., vol. 6, Issue 1, Page 167-179, 2023; Article no.AJOAIR.2465
64. Xu H, Sun W, Javaid A. Efficient spam 70. Zhou D, Burges CJC, Tao T. Transductive
detection across Online Social Networks. link spam detection. In: Proceedings of the
In: IEEE International Conference on Big 3rd international workshop on adversarial
Data Analysis (ICBDA). 2016;2016. information retrieval on the web. 2007;21-
DOI: 10.1109/ICBDA.2016.7509829 8.
65. Faris H, Aljarah I, Alqatawna J. Optimizing DOI: 10.1145/1244408.1244413
feedforward neural networks using Krill 71. Geng GG, Li Q, Zhang X. Link based small
Herd algorithm for E-mail spam detection. sample learning for web spam detection.
In: IEEE Jordan Conference on Applied In: Proceedings of the 18th international
Electrical Engineering and Computing conference on world wide web. 2009;
Technologies (AEECT). 2015;2015. 1185-6.
DOI: 10.1109/AEECT.2015.7360576 DOI: 10.1145/1526709.1526920
66. Opera: state of the mobile [web]. 72. Wu Y-S, Bagchi S, Singh N, Wita R. Spam
Available:http://www.opera.com/smw/2009/ detection in voice-over-ip calls through
12 semi-supervised clustering. In:
67. Sahami M, Dumais S, Heckerman D, Proceedings of the. Dependable systems
Horvitz E. A bayesian approach to filtering networks. 2009;307-16.
junk e-mail. In: AAAI Workshop on DOI: 10.1109/DSN.2009.5270323
Learning for Text Categorization; 1998. 73. Benevenuto F, Rodrigues T, Almeida V,
68. Gyöngyi Z, Garcia-Molina H, Pedersen J. Almeida J, Gonçalves M. Detecting
Combating web spam with trustrank. In: spammers and content promoters in online
Proceedings of the thirtieth international video social networks. In: Proceedings of
conference on very large data bases. the 32nd international ACM SIGIR
2004;576-87. conference. 2009;620-7.
69. Gyongyi Z, Berkhin P, Garcia-Molina H, DOI: 10.1145/1571941.1572047
Pedersen J. Link spam detection based on 74. Krishnamurthy B, Gill P, Arlitt M. A few
mass estimation. In: VLDB 2006. chirps about twitter. In: Proceedings of the
Proceedings of the 32nd international first workshop on online social networks.
conference on very large data bases. 2008;19-24.
2006;439-50. DOI: 10.1145/1397735.1397741
179