You are on page 1of 7

Survey on Spam Filtering in Text Analysis

Saksham Sharma, Rabi Raj Yadav


Computer Science Engineering, Vellore Institute of University
saksham.sharma2016@vitstudent.ac.in
rabiraj.yadav2017@vitstudent.ac.in

Abstract: The Internet rose as an amazing framework for the server, where case the spam never arrives at the client's or
overall communication and collaboration of individuals. The user’s mailbox. So in order to avoid the spam emails, or spam
misuse of this innovation by fraudster (for example spam or posts from social media fake posts and other things authors
viruses) created difficulties in the improvement of systems to have purposed various techniques to address the issues of
ensure a reasonable and secure experience communication. Data
common problems/issues during spam filtering using the
is being exploited by individuals and organisations to gain
competitive advantage, a substantial amount of data is being approach of natural language processing NLP is a branch of
generated by spam or fake users. Subsequently, attackers focus artificial intelligence that deals with the interaction between
on increasingly solid assault vectors like email: unfortunate computers and humans .The objective of NLP is to read,
casualties are contaminated utilizing either malicious connections interpret, comprehend, and understand the human dialects in a
or connections prompting malicious sites like in social media way that is profitable. These days, proficient spam filtering are
websites Facebook & twitter posts and Gmail emails. In this done by utilizing a combination of various classifiers
manner proficient filtering and blocking strategies for spam supported by advanced filter platforms such as Spam Assassin
messages are required. Sadly, most spam filtering arrangements (The Apache Spam Assassin Project, 2011) or
proposed so far are responsive, they require a lot of both ham and
Wirebrush4SPAM (Wirebrush4SPAM, 2011).A common
spam messages to effectively produce principles to separate
between both. In this survey paper, we studied a progressively filter usually combines three different kinds of techniques: (i)
proactive methodology that enables us to legitimately gather domain authentication schemes, (ii) collaborative approaches,
spam message. We will see how spam text is getting increase in (iii) content-based classifiers, (iv) characteristics-based
emails, social media fake posts, twitter, url etc. We can watch filters[9].A novel methodology for recognizing spam versus
current spam runs and get a duplicate of most recent spam non-spam online networking posts and offers more knowledge
messages in a quick and proficient way. In light of the gathered into the behaviour of spam users on Twitter. The methodology
information, we can produce layouts that speak to a compact proposes an optimized set of features independent of recorded
outline of a spam run. The gathered information would then be tweets, which are available for a brief timeframe on Twitter.
able to be utilized to improve current spam filtering methods and
Applying proper algorithms to the filters like Naive Bayes,
grow new scenes to productively channel sends.
SMO, do data collection and data pre-processing.[1]
Keywords: Text Analysis, Spam Filters, Spam Controlling
Algorithms, Naive Bayes, Classifier. II. PROBLEM DESCRIPTION

I. INTRODUCTION The most common problems that we have seen in each


paper is the problem in spam filters. Misclassifying a
Electronic mail (email) is critical for group connection, which legitimate e-mail to be spam by a spam filter is generally
has turned out to be generally utilized by numerous more destructive than misclassifying a spam e-emails to be
individuals’ people and associations. Simultaneously, email is legitimate.[2]
one of the quick rising and expensive issues connected with the
web today, in which case it is called spam email. The first spam Online spamming activities come in different structures,
email was sent on May 1, 1978 to many users on ARPANET. for example, malware dissemination, posting of
It was an advertisement for a presentation by Digital commercial URLs, fake news or abusive contents,
Equipment Corporation their DECSYSTEM-20 products sent automated generation of large volume of contents and
by Gary Thuerk, a marketer of theirs[7].Email spam has following or mentioning irregular users.[1] Spam filters are
steadily grown since then by 2019 it is estimated that spam profoundly looked for to naturally filter spam to clean
messages accounted for 56 percent of email traffic[8]. Spam mailboxes. Starting spam filters require a user to create
Filtering is a process of detecting spam text, spontaneous and regulations principally by taking a gander at patterns in a
undesirable emails and avoid those messages from getting to a run of the mill garbage email, for example, the presence of
client's inbox. Basically it is a software routine that removes specific words, blends of words, phrases. So in order to
approaching spam or redirects it to a "garbage" letter box (see tackle this problem the authors have proposed a hybrid
spam organizer). Likewise called "spam blockers," spam SMS classification system to detect spam or ham, using the
channels are incorporated with a client's email program. They Naive Bayes algorithm[3] and A priori algorithm. Also
are additionally incorporated with or included onto a mail using social media Spammers utilize URLs in the tweets to
divert the clients to pernicious destinations which contain segregating features from the given preparing information. We
infection in those locales. They likewise use URLs for used five fundamental and well-known feature selection
phishing and get the individual subtleties of the clients. capacities to compare the results achieved through proposed
TFDCR. The principal experiment is conducted utilizing a
conventional cluster preparing of classifier followed by a
III. CRITICAL ANALYSIS testing phase.

In this paper[1] the solution is to address the issue of uneven In the second experiment, we integrate an incremental learning
class dissemination, author has propose a novel term frequency model utilizing SVM with an updated feature set to relearn the
difference and category proportion based feature selection modified circulation of information[5]. In this experiment, we
work, named as TFDCR, for generating features with solid use TFDCR feature selection in both the conventional group
separating capacity from the preparing information. Author learning and the incremental learning models. Features are
used the incremental learning model utilizing Support Vector updated utilizing a selectionRankWeight heuristic capacity.
Machines (SVM) so that the learned decision model is updated
to modify the modified dissemination of information in the The dataset used are: the first dataset is an open ENRON
presence of floating concepts. Due to a frequent change in the dataset. This dataset contains pre-processed e-mail messages
content of e-mails, the relevance of the representative features with the removal of attachments. And the second dataset used
likewise varies over a period.[4] He propose a novel for filter evaluation is ECML. The ECML task-An and task-B
selectionRankWeight heuristic capacity based on the feature's datasets were made publically available during 2006 ECML-
category proportion difference to identify new features from an PKDD Discovery Challenge. [6]
approaching set of e-mails. The existing feature set is updated
by including these newly selected features before actuating The third dataset considered is PU dataset which contains four
incremental preparing of the classifier. folders PU1, PU2, PU3 and PUA that
Contain e-mails received by particular users.
Architecture:

A novel methodology for recognizing spam versus non-spam


online networking posts and offers more knowledge into
the behaviour of spam users on Twitter. The methodology
proposes an optimized set of features independent of
recorded tweets, which[2] are available for a brief
timeframe on Twitter. We take into record features related
to the users of Twitter, their records and their pairwise
engagement with each other. We experimentally
demonstrate the efficacy and robustness of our
methodology and compare it to a run of the mill feature set
for spam detection in the literature, achieving a noteworthy
improvement on performance.[4] As opposed to earlier
research discoveries, we observe that an average automated
spam record posted in any event 12 tweets per day at well-
definedperiods.
Working:

Proposed Algorithms-
Following proposed approaches:
Feature selection using the novel TFDCR function.
- This algorithm demonstrates the method for TFDCR feature Parameter tuning and classification models
selection work. - An effective classifier ought to be able to correctly group
previously unseen information by leveraging the experience
Incremental personalized e-mail spam filter with dynamic gained from preparing on n labelled samples, i.e. information
feature update. instances and the corresponding class.
- This algorithm describes the proposed system for incremental
personalized e-mail spam filtering with dynamic feature Feature importance and correlation
update work. - During an underlying investigation stage, a large number of
features have been used for preparing and some features were
The main experiment is performed to analyze how effectively discarded due to their relatively low commitment to the overall
TFDCR feature selection capacity identifies the most performance data.
relationships between attributes that exists in an informational
Tools that are required: index.
Scikit-learn toolkit and for evaluation, different metrics are Working:
utilized so as to maintain a strategic distance from an 1. Naive Bayes algorithm
inclination towards the dominant part class, especially when 2. Apriori algorithm
the dataset is imbalanced. 3. J48 decision tree
4. Random forest Algorithm
Honeypot dataset is openly available and useful for examining
spam movement on Twitter. It was utilized both as a dataset The informational collection used in this work is tested and
per se and for collecting the SPD datasets utilizing keywords. analyzed with four different classification techniques that use
SPD datasets to maintain a strategic distance from the potential cross-approval which are the accompanying: (I) Naive Bayes,
danger of a high false positive rate. (ii) SMO, (iii) J48, (iv) irregular forest. After applying every
one of the classifiers, the performance of each classifier has
been analyzed based on different performance metrics as: true-
Architecture: positive rate, false-positive rate, precision, recall, F-measure,
ROC area, time taken to fabricate classifier model. [5]

Tools and Technology: For implementing the required errand,


creators have used WEKA device which is an

open-source device for information mining and the classifiers


used in
this work are: (I) Naive Bayes, (ii) SMO, (iii) J48, (iv) irregular
forest.

Architecture:

In this paper[3] author has proposed a solution in following


ways, basically he used the data mining techniques.

Data Collection: Data were donated by George Foreman, and


the collected information set is used in UCI machine learning
repository.

Data Pre-processing: In the real-world informational


collection, which comprises of numerous mistakes, they are
cleaned and removed so as to have accurate results of the
information sets. In this step informational collection it is
transformed and integrated into an appropriate group before
classifiers are applied in the informational index.

Applying Algorithm: After having the pre-processed file, all


the grouping calculations, namely Naive Bayes, SMO, J48
decision tree, and irregular forest, have applied so as to
discover features based on which spam being identified.

Performance Evaluation: After applying all classifiers, each


of them was evaluated based on performance metrics so as to In this paper[4]r author introduced deep learning models based
figure out the best classifier. on convolutional neural networks (CNNs). Five CNNs and one
feature-based model are used in the ensemble.Each CNN uses
He stated that classification technique as other information different word embeddings (Glove, Word2vec) to train the
mining techniques like clustering, whereas affiliation isn't model. The feature-based model uses content based, user-
capable of prediction related issues. Clustering is capable of based, and n-gram features.Neural network acts as a meta
parcelling the related information elements in same set, and classifier.The difference between their proposed method with
then again, affiliation is generally used for establishing existing methods is that they combine both handcrafted
features and word embedding features to capture more term t belonging to one of the representative synsets for the
information about spam and non-spam tweets. topic.
.

Working: Tools that are required:


They have used WordNet Lexical Database. The hierarchical
WordNet database groups words into synset.
User-Based Features: It checks whether the user profile is
verified or not, finds the length of the user profile description, In this paper[6] author mainly focuses to increase the accuracy
checks whether location information is given by the user or not, of the existing Naive Bayes Spam Filter. This algorithm will
counts the number of followers, friends of the user,finds the work on text modification and which cause hindrance in
reputation score(#followers/(#followers + #friends)),finds the classification of email. Common Spam senders are able to
number of tweets posted and lists that the user has subscribed bypass spam detectors by using leetspeak and diacritics.
to.[1] Leetspeak is an alternative alphabet that is primarily used on
the Ìnternet. Diacritics are the accents placed on words to
Content-Based Features: It finds the number of modify the appearance. Leetspeak allows the spam senders to
words,capitalization words, exclamation and question mark change letters into symbols or a series of symbols. For
symbols, URL,hashtags, mentions in the tweet. example,”A” can be written as ”/-\”. When a word is modified
using leetspeak, spam detectors are not able to identify the
Tools that are required: email as spam, which creates a false positive.
They used HSpam and 1KS10KN dataset.Along with feature
model they used four methods (CNN + Twitter Glove, CNN
+Google news, CNN+Edinburgh, CNN+H Spam).They used Bayesian poisoning is a technique used by e mail spammers to
accuracy, precision, and F-Measure for word embeddings as attempt to degrade the effectiveness of spam filters that rely on
parameter to compare. Bayesian spam filtering.

In this paper[5] author introduces a feature selection technique Working:


for a spam-filtering domain that takes advantage of semantic
information.They group word based features into semantic They first applied pre-processing technique to remove the issue
topics to generate feature vectors.They involves the execution of diacritic and leetspeak.In python isalpha() function is used
of three feature selection methods.Information Gain, Latent with below algorithm:
Dirichlet Allocation, and Semantic-based Feature Selection. If a = leetspeak: then replace(a, c)
If b = diacritcs: then replace(b, c)
Working:
After this they applied multinomial Naive Bayes algorithm
They discussed basic four methods: along with different machine learning algorithm.
i) content-based filtering: a set of methods able to perform a
detailed analysis of message content (text, image/s, attached Multinomial Naive Bayes (MNB) is the probability of the
documents) to determine a class for the message. words (tk), within a message d given a class of the message,
(ii) collaborative schemes: sharing detailed information about spam or ham. It assumes that the message is a bag of tokens or
received spam messages words, such that the order of the tokens is irrelevant.
(iii) domain authorization methods: to define trust servers Multinomial Naive Bayes essentially counts the relative
(identified by their IP addresses) to send messages for a certain occurrences of a particular token within the message to
domain determine the conditional probability.
(iv) characteristics-based filters: the number of recipients P(c|d) = P(c)*Product( 1≤k≥nd P(tk|c))
receiving the same e-mail.
Tools that are required:
Topic guessing methodology: They used noun semantic They have used Spam Server Spam assassin for the datasets.
relation included in Wordnet to design it:
(i) hyponym (X is a kind of Y) CONCLUSION AND FUTURE WORK
(ii) hypernym (X includes the notion of Y among others)
(iii) meronym (X is a part of Y) This study critically analyses the different approach in
(iv) holonym (X contains Y among others). different paper ranging from basic text pre-processing, through
They find e-mail topics, by selecting a hierarchical level (h) using naive bases classifier, leading to the advance concept of
in order to semantically group terms (synsets) into more neural network and feature selection. While studying through
generic topics. A topic is present in a message if it contains a all paper we found that naive Bayes classification is the oldest
approach.[4] After that the trend of feature selection came into
highlights. At this present age of machine learning and deep
learning, one paper introduced the concept of applying neural
network. Different paper have used different datasets and for
evaluation metrics they have used different approach like
precision, recall and confusion matrix. The confusion matrix
(i)false positive errors (FP, legitimate messages classified as
spam) (ii) false negative errors (FN,undetected spam e-mails)
(iii) true positive hits (TP, number of spam messages detected)
and (iv) true negative hits (TN, number of legitimate messages
correctly classified).[2]
Although work and research done by above prominent authors
seem promising, new further progressing work can be done.
Among all the above paper Topic Guessing models is able to
achieve the best results when compared with other alternative.
The improvement can be done to decrease the obfuscation(the
destruction of the intended meaning of communication by
making the message difficult to understand).It will help in
precise guessing of topic to classify it as ham or spam email.
Further this work can be continued to spam detection of image
and video content as instaspam as the world is growing into
digital market. [5]

REFERENCES
[1] (Sanghani, G., & Kotecha, K. (2019)). Incremental personalized E-mail
spam filter using novel TFDCR feature selection with dynamic feature
update. Expert Systems with Applications,115,287–299.
[2] (Inuwa-Dutse, I., Liptrott, M., & Korkontzelos,I. (2018)). Detection of
spam-posting accounts on Twitter.
[3] (Satapathy, S. C., Bhateja, V., & Das, S. (Eds.). (2019)). Smart
Intelligent Computing and Applications. Smart Innovation, Systems
and Technologies. doi:10.1007/978-981-13-1921-1
[4] (Madisetty, S., & Desarkar, M. S. (2018)). A Neural Network-Based
Ensemble Approach for Spam Detection in Twitter. IEEE Transactions
on Computational Social Systems, 1–12.
[5] (José R. Méndez, Tomás R. Cotos-Yañez, David Ruano-Ordás2019).A
new semantic-based feature selection method for spam filtering
[6] (Peng, W., Huang, L., Jia, J., & Ingram, E. (2018)). Enhancing the
Naive Bayes Spam Filter Through Intelligent Text Modification
Detection. 2018 17th IEEE International Conference On Trust, Security
And Privacy In Computing And Communications/ 12th IEEE
International Conference On Big Data Science And Engineering
(TrustCom/BigDataSE).
[7] History of Spam. Mailmsg.com. Archived from the original on
26 March 2006. Retrieved 11 July 2006.
https://web.archive.org/web/20060326032433/http://www.mailm
sg.com/SPAM_history.htm
[8] Global spam volume as percentage of total e-mail traffic by
month. https://www.statista.com/statistics/420391/spam-email-
traffic-share/
[9] (N. Pérez-Diaz, D. Ruano-Ordás, F. Fdez-Riverola, J.R.
Méndez,(2012)) SDAI: An integral evaluation methodology for
content-based spam filtering models, Expert Syst. Appl. 39
12487–12500, http://dx.doi.org/10.1016/j.eswa. 2012.04.064

You might also like