Miniproject Thirukumaran

SPAM FILTERING USING MACHINE LEARNING
MINI PROJECT REPORT
Submitted by
THIRUKUMARAN.S (171CS295)
PRAVEENA.S (171CS225)
NIVASRI.J (171CS211)
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
BANNARI AMMAN INSTITUTE OF TECHNOLOGY
(An Autonomous Institution Affiliated to Anna University, Chennai)
SATHYAMANGALAM-638 401
NOVEMBER 2018
BONAFIDE CERTIFICATE
Certified that this project report “ SPAM FILTERING USING MACHINE

LEARNING TECHNIQUE ” is the bonafide work of “ S. THIRUKUMARAN
( 171CS295 ) , J.NIVASRI ( 171CS211 ) , S. PRAVEENA ( 171CS225) ”who
carried out the mini project work under my supervision.
SIGNATURE SIGNATURE
Dr. S. LOGESHWARI M.SURIYA
HEAD OF THE DEPARTMENT SUPERVISOR
DEPARTMENT OF COMPUTER DEPARTMENT OF COMPUTER

SCIENCE AND ENGINEERING SCIENCE AND ENGINEERING
Submitted for Viva Voice examination held on ________________
INTERNAL EXAMINER EXTERNAL EXAMINER

BANNARI AMMAN INSTITUTE OF TECHNOLOGY
An Autonomous Institution Affiliated to Anna University – Chennai | Approved by AICTE
SATHYAMANGALAM – 638 401
DECLARATION FORM
We hereby declare that the project report entitled “Spam Filtering

Using Machine Learning”submitted in partial fulfillment of the requirements for the award
of the Degree of Bachelor / Master of Engineering / Technology / Business Administration in
Computer Science And Engineering is a record of original work done by us and the
report is in compliance with the “Guidelines for plagiarism checking and submitting
plagiarism-free UG/PG project reports” currently in force.
We also declare the contents in Plagiarism analysis report and submitted project report are
identical.
Project Members
S.
Roll No. Name Signature
No.
1 171CS295 S.THIRUKUMARAN
2 171CS211 J.NIVASRI
3 171CS225 S.PRAVEENA
Assistant Professor
Department of Computer
Science and Engineering
Supervisor name Designation & department Signature
TABLE OF CONTENTS
CHAPTERNO. TITLE PAGENO
ABSTRACT i
LIST OF TABLE ii
LIST OF FIGURES iii
1. INTRODUCTION 1
1.1 GENERAL AIM1
1.2 Definition and terminologies 1
1.2.1 SMS over e-mail 1
1.3 Project motivation 1
1.4 Spam words 2
1.5 Objective 4
2. LITERATURE SURVEY 5
2.1 SOLUTION BY DIX IT 5
2.2 FREQUENCY ANALYSIS IN 20105
2.3 SOLUTION TO POINT-POINT MESSAGES6
2.4 FILTERING OTHER TEXT CLASSIFICATION

DOMAIN6
2.5 SMS SPAM DATA 7
2.5.1 Analysis of an SMS spam corpus 7
3. FROM MAIL TO MESSAGE FILTERING 9

3.1 SIMILARITIES TECHNIQUES USED
IN MAIL AND SMS 9
3.2 DIRECT FILTERING TECHNIQUE 9
3.2.1 Automatic text classification 9
3.3 COLLABORATIVE FILTERING 10
3.3.1 Technical issue 10
3.3.2 Message length 11
3.4 CLIENT SIDE SOLUTIONS TO SPAM FILTERING 11
4. CONTENT BASED SMS SPAM FILTERING 13
4.1 WORK PROPOSING THE APPLICATION 13
OFAUTOMATIC TEXT
4.1.1 Spam data set 13
4.2 K-NEAREST NEIGHBOR ALGORITHM: 13
4.2.1 Dual filtering methods 14
4.2.2 Evolutionary classifiers 14
4.3 U CS-SUPERVISED CLASSIFIER SYSTEM: 14
4.3.1 Comparison of administered learning calculations: 15
4.3.2 Thirteen classifiers 15
4.3.3 Three -top ranked algorithm 15
4.4 CHALLENGE RESPONSE SYSTEM: 16

4.4.1 Limitations to the challenge response system 16
4.4.2 Solutions installed on client’s mobile device 16
4.5 FEATURE ENGINEERING IN SMS SPAM 17
4.5.1 Feature 17
4.5.2 Stylist 17
5. CONCLUSION 18
5.1 MULTILINGUALENVIRONMENTS 18
5.2 HYBRID SOLUTION 18
5.3 ADVANCED ADDERESS FILTERING 19
5.4 SCALABILITY AND REAL-WORLD 20
DEPLOYMENT 20
5.5 INDUSTRY COLLABORATION 20
APPENDIX 20
Appendix -1(screenshot) 20
Appendix-2(source code) 25
REFERENCE 28
ABSTRACT
The development of cell phone clients has prompt a sensational expanding of
SMS spam messages. By and by, battling portable telephone spam is
troublesome by a few elements, including the bring down rate of SMS that has
permitted numerous clients and administration suppliers to disregard the issue,
and the restricted accessibility of cell phone spam-separating programming.
Then again, in scholastic settings, a noteworthy impair is the shortage of open
SMS spam datasets, that are painfully required for approval also, examination
of various classifiers. Also, as SMS messages are genuinely short, content-
based spam channels may have their execution debased. In this paper, we offer
another genuine, open and non-encoded SMS spam accumulation that is the
biggest one supposedly. The analysis of results indicate that the procedure
followed does not lead to near-duplicates and, consequently, the proposed
dataset is reliable to use for evaluating and comparing the performance achieved
by different classifiers. In addition, we analyse the execution accomplished by a
few built up machine learning strategies. The outcomes demonstrate that
Support Vector Machine outflanks other assessed classifiers what's more,
henceforth, it tends to be utilized as a decent pattern for further Correlation.
Keywords:Spam filtering, Text classification, SMS spamdataset

i
LIST OF TABLES
TABLE NO. TITLE PAGE NO.
1.1 Project Motivation 2
1.2 List of Spam words 4
2.1 Analysis of an SMS spam corpus 8
ii
LIST OF FIGURES
FIGURE NO TITLE PAGE NO.
1.1 Sublime IDE - I 20
1.2 Sublime IDE - II 21
1.3 Sublime IDE - III 22
1.4 Python IDE - I 23
1.5 Python IDE - II 24
iii
CHAPTER 1
INTRODUCTION
1.1 General aim:
The general purpose of the errand is to frustrate the spam messages

particularly accomplishing the structure. The spam messages ought to be
isolated and the certifiable one have to stand out.it grants the honest to goodness
messages must be arrive safely in the inbox.
1.2 Definition and terminologies:
Spam is unconstrained and bothersome messages. SMS spam is consistently

transmitted over an adaptable framework. Standard spammers are moving to the
adaptable frameworks as the landing from the email channel is decreasing a result
of elective Filtering, industry composed exertion and customer care.
1.2.1 SMS over e-mail:
Short Messaging Service compact correspondence system is charming for

criminal packs for different reasons. It is getting the chance to be cost elective to
target SMS by virtue of the availability of limitless prepay SMS packages in
countries. Similarly SMS can result in higher response rates than email spam as
SMS is a trusted in organization with endorsers OK with using it for inborn
information exchange.
1.3 Project motivation:
1
The primary inspiration for this task lies in the perception of number of instant
messages send over the UK amid the year 2000-2008 submitted by the from the
Mobile Data Association.
Year No .of. text messages
2005 2.5 billion
2006 3.5 billion
2007 4.4 billion
2008 6.3 billion
Table 1.1: Project Motivation
1.4 Spam words:
The easy method to wind up the spam messages in the client's inbox is to
stack the inbox with the most widely recognized spam words. There are several
most regular spam words exist. These word give the quickest Filtering procedure to
get alleviation from spam messages.
2
S.no Spam words
1 You won!!!
2 Enjoy!!
3 Here an opportunity
4 Free!!
5 You earn
6 Free tickets
7 Surprise
8 No credits check
9 Congratulation!!!
10 Free trail
11 Offers!!!
12 Money back offers
13 4U
14 Billion Dollars
15 Don’t Hesitate!!!
16 Time to get gifts
17 Earn $
18 $$$$
3
19 Free calls
20 Act now
21 Apply now
22 Won!!!
23 Expensive gifts
24 Special offer!!
25 Offers for the Auspicious day
Table 1.2: List of spam words
1.5 Objective:
The principle objective is to recognize spontaneous and undesirable messages and

to keep those sends from getting to a client's inbox. Our fundamental target is to
grow such summed up model that could anticipate or channel the datasets on
different foundations with a superior exactness level. To recognize and address the
difficulties and choices raised by the assemblage and misuse of a message and
investigate compelling methodology for distinguishing and recovering highlights
for examination.
4
CHAPTER 2
LITERATURE SURVEY
2.1 Solution by Dix it:
An early brought together Filtering arrangement was recommended as

opposed to utilizing the standard content characterization approaches, the messages
were spoken to as a character-based vector which was anticipated into a littler
element space and bunched to recognize groups of spam and non-spam SMS
messages.
New messages are grouped dependent on their separation from the spam and
non-spam bunches. This methodology is spurred by the absence of watchwords
accessible for the typical arrangement calculations because of the short length of the
SMS messages, however the aftereffect of this methodology at ordering spam was
not assessed. The conduct of senders after some time can be demonstrative of
whether a given message is spam or not.
2.2 Frequency analysis in 2010:
They both include a recurrence investigation of SMS to a current spam channel with
the objective of enhancing the focal framework's constant handling speed. By
considering the recurrence of spam got amid various eras they concentrate Filtering
on particular eras. Their methodology enhances the throughput of the framework
incredibly, however at the expense of an expansive decline in spam location and a
noteworthy ascent to 2.5% positive rate. On substance-based innovations, for
example, informal organization examination have turned out to be famous in the
Filtering territory Network investigation approaches are address-based Filtering
5
approaches which plan to anticipate whether a sender is a potential spammer or not
and furthermore to foresee whether the message itself is spam or not. There is some
proof of the beginning of the utilization of these systems for SMS Filtering.
2.3 Solution to point-point messages:
He exhibited a fascinating answer for point-to-point messages, those sent starting

with one versatile then onto the next, which consolidates informal organization
investigation with phantom examination. They produce a coordinated chart from
message logs and recommend two sorts of channels, a disconnected channel and an
online channel. The disconnected channel utilizes highlights from a one-jump
informal community that models longer-term sender conduct while the online
channel centers around what number of collectors a sender has sent to in a given
day and age which is Separated from a two-jump interpersonal organization and
joined with transient unearthly examination of accommodation conduct. They
propose that their methodologies can be joined with substance-based methodologies
either sequentially, where aftereffects of free channel frameworks can be
consolidated, or consecutively where the conduct based channel can give
contribution to the substance based framework or the other way around.
2.4 Filtering Other Text Classification Domain:
There has been other relevant spam classification work recently in related short text
message domains. There is significant evidence of spam in social networks
including instant message and twitter spam. Typically, fake accounts are used to
automatically send messages that contain links that can be used to gather marketing
information.
6
2.5. SMS spam data:
Any managed machine learning approach, for example, those referred to above is
extremely reliant on the quality and also amount of the preparation information
which is accessible to it spam Filtering utilizing content arrangement strategies
needs delegate, auspicious corpora of spam and non-spam messages with which to
prepare the calculations. In the email world, various different corpora are accessible,
including the Spam Assassin corpus or the TREC email corpora.
.
2.5.1 Analysis of an SMS spam corpus:
With the objective of breaking down and distinguishing distinctive

classifications of spam, we have played out a bunching investigate the corpus
displayed in the past segment. The crude archives were parsed and prepared by the
standard uni-gram-based content bunching hones. Utilized a stop-list containing 499
passages to expel regular utilitarian words.
7
Annotation Top terms
Ringtones send, ringtone, text, tone, free, SMS,
reply, mobile
Claims accident, entitled, records, pounds,
claim, message, compensation
Competitions txt, win, voucher, cash, 150p,

send, entry
Prizes prize, guaranteed, urgent, todays,
valid, claim, draw, cash
Voicemail please, message, voicemail, waiting,
call, delivery, urgent
Dating dating, service, contacted, guess,
statement, points, private
Services mins, free, camera, orange, latest,
phone, camcorder
Finance help,debt, credit, info, government,
loans, solution.
Chat naughty, ring, alone, chat, xx, heard,
luv, home
Miscellaneous secret, admirer, special, looking,
reveal, contact, call
Table 2.1: Analysis of an SMS spam corpus
When we compare these clusters to the types of spam identified by the

GSMA, a close correspondence to the three main types which are described as,
• SMS spam
• Premium rate fraud
• Phishing/Smishing
8
CHAPTER 3
From mail to message Filtering
3.1 Similarities in techniques used in both mail and SMS spam Filtering:
The likeness of SMS spam Filtering to email spam Filtering proposes that
demonstrated advances in email spam Filtering might be helpful in fighting SMS
spam. The spam Filtering incorporates both direct substance Filtering and
communicant content Filtering strategy.
3.2 Direct Filtering technique:
The immediate substance Filtering advancements utilize the direct literary

substance of the message and fluctuate from the shortsighted catchphrase Filtering
to the more differed Spam Assassin compose lead sets, to the more mind boggling
programmed content arrangement approaches.
3.2.1 Automatic text classification:
Programmed content arrangement utilizes managed machine learning

calculations to prepare a model on an arrangement of precedents of spam and real
messages which are named suitably. This set is known as the preparation set and
ought to be illustrative of run of the mill spam and real messages. The model gains
from this preparation set how
To recognize spam from non-spam and is utilized to foresee whether new Messages
are spam or not. Programmed content grouping requires a portrayal of each message,
regularly a n-dimensional vector where each measurement speaks to a trademark or
highlight that is prescient of the content order issue. The highlights are recognized
9
by parsing and tokenization of the printed substance, a commonplace tokenization
being word tokenization however n-gram character-based or word-based
tokenization are additionally well known. The estimation of each component in the
vector portrayal of a message is typically illustrative of the recurrence of event of
that element in the message.
3.3 Collaborative Filtering technique:
Communicant content Filtering systems enables clients to share data on spam.

A fruitful methodology is to produce a signature.it is at times known as unique
mark from the substance of a known spam message and this is appropriated and
imparted to a gathering of clients.
A mark is produced for every single approaching message and checked against
the known spam marks, and matches are named as spam messages. A notable
precedent is Vipul's Razor4, an UN-unveiled variety of which is utilized by the
spam Filtering organization Cloudmark5.
3.3.1 Technical issues:
Collective Filtering methods depend vigorously on the quality and measure

of client detailing of spam, which can be troublesome in the versatile world as
brilliant cell phones and proper programming innovation are important to help
client revealing usefulness. The way that a significant number of similar issues
apply crosswise over both Filtering spaces underpins utilizing demonstrated email
Filtering innovations. The two areas have the specialized issues of effectiveness of
Filtering continuously and need to settle on customer side and server-side Filtering
Signicantly the attributes of email spam Filtering that make it a testing Filtering
10
issue exchange additionally to the versatile space. The unequal and dubious
classification costs, specifically the prerequisite to prohibit false positives are as
obvious in SMS spam Filtering as in email spam Filtering. Notwithstanding the
above issue of dealing with idea float, the consistent change in spam with the end
goal to sidestep channels, is likewise a key test. There is a solid proof of idea float
in current spam with spammers utilizing low volumes to stay away from volume
channels. As spam turns out to be more pervasive and the Filtering turns out to be
more troublesome accordingly, idea float will turn into a noteworthy issue in spam
Filtering.
3.3.2 Message length:
For spam however a number of additional issues arises, regarding the

message itself. The maximum length of an SMS message is 160 characters which
means there is little material for content-based Filtering. Due to the short message
length available, subscribers use an idiosyncratic language subset with
abbreviations, bad punctuation, emoticons, etc.
It has also been shown that spam Filtering can be improved by including
contextual information found in the headers but messages contain far less
information in the headers, which more or less context to work with. The mobile
technology is also a factor.
3.4 Client side solutions to spam Filtering:
This Solution to spam Filtering must operate on resource-constrained mobile

devices. Despite the increasing use of smartphones, with only basic voice call and
text functionality are still in the majority, especially in emerging markets where
11
such phones continue to be launched and sold. Such devices also do not have the
functionality to display a spam folder such as is common with email clients, so it is
more difficult to tell users that messages have been blocked. Further, mobile
devices typically do not facilitate user reporting of spam messages, unless this
service is covered by the network or by a third party, which makes collaborative
content filters, which rely on user feedback, difficult to implement. Recently there
has been research into applying the successful email spam Filtering techniques to
SMS spam Filtering with some success. The next section will review the
developments in SMS spam Filtering which tend to focus on using the more popular
supervised learning or text classification approaches but it will also discuss research
into other types of classification approaches used including frequency analysis and
social network analysis.
12
CHAPTER 4
CONTENT BASED SMS SPAM FILTERING
4.1 Work proposing the application of automatic text classification:
Early work proposing the utilization of programmed content order systems to

SMS spam Filtering incorporates work by him who recommended that Support
Vector Machines would be suitable for the issue yet did not assess their utilization,
and work by Him that considered utilizing k-NN more tasteful.
4.1.1 Spam data set:
He assessed various characterization calculations on two SMS spam informational

indexes and presumed that these methods can be successfully exchanged from email
to SMS spam Filtering, with SVM’s being the most appropriate. Work by him on a
Chinese spam informational collection utilized the easier Winnow calculation, a
direct more tasteful that has indicated great per Him a Bayes student to remove
watchwords for checking track halfway, permitting a spamming score to be allotted
.
4.2 K-nearest neighbor algorithm:
He added a cost capacity to a Naive Bayes channel which alloted a surprising

expense to false positives. This converts into a high spam order edge, and a higher
limit results in higher spam accuracy. He proposed utilizing a closest neighbor
calculation as a major aspect of a multi Filtering approach. After highly contrasting
posting, a message is grouped by a channel utilizing harsh sets, which give rough
depictions of ideas. On the off chance that this characterizes the message as spam, it
is then passed to the k-Nearest Neighbor classifier.
13
4.2.1 Dual Filtering method:
An assessment on an informational collection of 550 spam and 200 non-spam

SMS with k = 12 demonstrated that this double Filtering strategy is quicker and
more exact than utilizing .In other work proposed a straightforward list show which
figures the spam score of each message as a component of the recurrence of event
of highlights over the distinctive classifications of spam and non-spam SMS in the
preparation information. Their methodology utilizes transformed files for access
speed and simplicity of refresh and they proposed that an outfit of these file models,
each dependent on an alternate list of capabilities got from the Lexical investigation
of the substance and the header data, gives Good Filtering execution.
4.2.2 Evolutionary classifiers:

Junaid and Farooq examined the utilization of developmental classifiers for
Filtering SMS spam. They contrast and managed learning calculations and four
classifiers. Results demonstrate practically identical execution by and large however
the managed Classifier System, a Michigan style control based learning classifier
framework, outflanked the others when more than 3000 messages were introduced
for Filtering.
4.3 UCS-supervised Classifier System:

The creators guarantee this is because of the ability of UCS to develop rules
on the web. An impediment of such classifiers is the execution at run time work
announced that the normal time to arrange a message utilizing the calculations is a
small amount of a second, while most classifiers expect three to four seconds for
order in spite of the fact that UCS is quicker at 1.2 seconds
14
4.3.1 Comparison of administered learning calculations:
Most recently He have reported on a comparison of a number of learning

algorithms to provide baseline results for algorithm. They use a corpus of 5,574
messages in which 747 are spam and 4,827 non-spam SMS messages. Two methods
of tokenization are tested, a tokenizer which separates on any character other than
alphanumeric and certain punctuation characters and a variation which also
tokenizes domain names and mail addresses.
4.3.2 The thirteen classifiers:
The 13 classifiers utilized in the trial incorporate 8 varieties of Naive Bayes, a

straight SVM, a Minimum Description Length classifier, k-Nearest Neighbor, the
choice tree student C4.5, and PART, a lead student. The straight SVM alongside
alphanumeric tokenization performs best, with a precision of 97.64%, a false
positive rate of 0.18%, and a review on the spam class rate of 83.1%.
4.3.3 Three top-ranked algorithms:
The following three best positioned calculations, to be specific supported NB,

helped C4.5 and
• PART, were not fundamentally more terrible, each with an exactness of
97.5%.Hybrid methodologies have likewise been proposed which consolidate
content based Filtering with test reaction, a strategy which consequently sends
an answer to a message sender which requires the sender to play out some
activity
• To guarantee conveyance of their message.

15
4.4 Challenge response system:
Test reaction frameworks have been advanced for spam Filtering however
there is extensive proof disparaging them. Their constraints incorporate expanded
system track, issues with accepting messages from substantial mechanized online
administrations, for example, mailing records and the way that they are available to
mishandle.
4.4.1 Limitations to the challenge response system:
He proposal to address some of the limitations was to use challenge response

for a limited subset of SMS messages where the content-based classifier is uncertain.
The filter is operated centrally and they identify a number of protocols of use
involving an image CAPTCHA to ensure delivery. His suggestion was that an
image CAPTCHA should be generated for each message which may be centrally
black or white-listed.
4.4.2 Solutions installed on client’s mobile device:

A great part of the work as of now examined over server-based and midway
engaged answers for the spam Filtering issue. In any case, there are various
scientists who have recommended arrangements which are introduced on the
customer versatile or that are a piece of a disseminated methodology that
consolidates focal server handling with customer side preparing. Deng and Peng
proposed a circulated Filtering framework utilizing a Naive Bayes classifier on the
customer portable. Client input on the customer side is expected to corm Spam
orders and miss characterizations and these are accounted for back to the SMS
handling focus, by means of a short code.
16
4.5. Feature Engineering in SMS spam:
The success of machine learning techniques depends greatly on the selection of an
appropriate feature set for the problem. There has been work in feature engineering
for spam which attempts to identify the best features to use in the SMS message
representation.
4.5.1 Feature set:
A list of capabilities including words, standardized words, character tie trio-grams

and word bi-grams recommended him he has given a base list of capabilities to a
significant part of the work in highlight building. He found that a slight minor
departure from this including symmetrical word enormous rams enhanced the
execution of characterization calculations on SMS spam information.
4.5.2 Stylistic features:
He extended the base list of capabilities by including highlights dependent on

stylometry recommended in creator attribution contemplates. The complex
highlights separated utilizing shallow etymological investigation incorporated the
byte and the normal byte length of messages, grammatical feature n-grams and
emoji and uncommon character frequencies which were removed utilizing
physically developed dictionaries. These expressive highlights, tried on a Korean
spam informational collection, were appeared to be conceivably helpful in
enhancing the execution of a greatest entropy based spam channel.
17
CHAPTER 5
CONCULSION
This paper has exhibited the cutting edge in SMS spam Filtering and has
inspected various ways to deal with the issue which have been recommended and
tried. Utilizing distinctive informational indexes, different analysts have
demonstrated that managed learning calculations can be elective for spam
characterization, with detailed correctness of up to 98%. There is likewise some
proof of the utilization of non-content methodologies, for example, informal
organization examination and the ID of examples of message accommodation.
5.1 Multilingual environments:
Versatile systems are dialect autonomous and, particularly in different dialect

locales, they handle messages in a blend of dialects. Nonetheless, all the spam
explores up to now has utilized single-dialect informational collections, and the
utilization of word tokenization acquaints a natural confinement with the
compactness of Filtering arrangements. Because of the short length of the SMS
message content and the absence of recognizing data in the headers it very well may
be hard to distinguish the dialect of each specific message. Power in multilingual
situations will be a key prerequisite of sent spam channels. Shared information.
5.2 Hybrid solutions:
With spam Filtering there is no single arrangement that works. It is likely that
a few sorts of spam can be better sifted by specific strategies, so like the mail space,
we see half breed arrangements as a road. Given that SMS Filtering must occur
under exceptionally strict handling time limitations, content-based and cooperative
18
channels could be conveniently enlarged with basic and less asset concentrated
Filtering strategies, for example, boycotting.
5.3 Advanced address-based Filtering:
The move as of late in spam Filtering has been towards cutting edge address-
based Filtering approaches including informal organization examination these
procedures ought to be considered in the versatile space additionally however the
absence of sufficient information will hamper such reports.
5.4 Scalability and real-world deployment:
The work surveyed here speaks to look into models and arrangements
arranged under research center conditions. This demonstrates a channel dependent
on help vector machines might be a compelling arrangement. For a certifiable
organization be that as it may, the issues of scale and vigor wind up significant and
fast databases, grouping, and productive information structures will be required.
5.5 Industry collaboration:
Progression held ought to be affirmed by bonafide fundamental associations,

which simply the framework directors can energize. As volume of spam augment,
the assurance of substance based Filtering should attempt appealing to industry. We
also watch different empowering focuses in the ebb and flow circumstance with
research and much potential for further advances. All things considered there are
various contender progresses and many fighting courses of action and the best plan
may well wind up being a mix of these.
19
Appendix 1 (Screenshots ):
Figure 1.1: Sublime IDE-I
20
Figure 1.2: Sublime IDE-II
21
Figure 1.3: Sublime IDE-III
22
Figure 1.4: Python IDE-I
23
Figure 1.5: Python IDE-II
24
Appendix 2(Source code):
import pandas as pd
df = pd.read_table('SMSSpamCollection',
sep='\t',
header=None,
names=['label', 'sms_message'])
print("Dataframe head before modifications\n", df.head(), "\n")
df['label'] = df.label.map({'ham': 0, 'spam': 1})

print("The size of the array: ", df.shape)
print("Dataframe head after modifications\n", df.head(), "\n")
# 1. Convert all strings to their lower case form

documents = ["Hello, how are you!",
"Win money win from home.",
"Call me now.",
"Hello, Call hello you tomorrow?!!!"]
lower_case_documents = []
for i in documents:
lower_case_documents.append(i.lower())
# print("Printing lower case documents: \n", lower_case_documents)
# 2. Removing all punctuations

sans_punctation_documents = []
import string
for i in lower_case_documents:
sans_punctation_documents.append(i.translate(str.maketrans('',
'', string.punctuation)))
25
#print("Printing sans punctation documents: \n",
sans_punctation_documents)
# 3. Tokenization
preprocessed_documents = []
for i in sans_punctation_documents:
preprocessed_documents.append(i.split(' '))
print("Printing preprocessed documents: \n",

preprocessed_documents)
# 4. Count frequencies
frequency_list = []
import pprint
from collections import Counter
for i in preprocessed_documents:
frequency_counts = Counter(i)
frequency_list.append(frequency_counts)
#pprint.pprint(frequency_list)
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()
count_vector.fit(documents)
print(count_vector.get_feature_names())
doc_array = count_vector.transform(documents).toarray()
print("Doc array: \n", doc_array)
frequency_matrix=pd.DataFrame(doc_array,columns=
count_vector.get_feature_names())
print("\nFrequency matrix: \n", frequency_matrix)

26
# Split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'],
df['label'], random_state=1)
print('Number of rows in the total set: {}'.format(df.shape[0]))

print('Number of rows in the training set:
{}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))
print(df.head())
df['label'] = df.label.map({'ham':0, 'spam':1})

print(df.shape)
print(df.head());
27
Reference:
1. E.B.Cleff,‘‘Privacyissuesinmobileadvertising,’’Int.Rev.LawComput. Technol.,
vol. 21, no. 3, pp. 225–236, 2007.
2. A.Lambert,‘‘AnalysisofSPAM,’’M.S.thesis,Dept.Comput.Sci.,Univ. Dublin,
Trinity College, Republic of Ireland, 2003, pp. 1–100.
3. C. Wang et al., ‘‘A behavior-based SMS antispam system,’’ IBM J. Res.

Develop., vol. 54, no. 6, pp. 3:1–3:16, Nov./Dec. 2010.
4. B. Reaves, N. Scaife, D. Tian, L. Blue, P. Traynor, and K. R. Butler, ‘‘Sending

out an SMS: Characterizing the security of the SMS ecosystem with public
gateways,’’ in Proc. IEEE Symp. Secur. Privacy (SP), May 2016, pp. 339–356.
5. T. Yamakami, ‘‘Impact from mobile SPAM mail on mobile internet services,’’ in

Parallel and Distributed Processing and Applications. Berlin, Germany: Springer,
2003, pp. 179–184.
6. G. Camponovo and D. Cerutti, ‘‘The spam issue in mobile business: A

comparative regulatory overview,’’ in Proc. 3rd Int. Conf. Mobile Bus., New York,
NY, USA, 2004, pp. 1–17.
7. J. Fu, P. Lin, and S. Lee, ‘‘Detecting spamming activities in a campus network

using incremental learning,’’ J. Netw. Comput. Appl., vol. 43, pp. 56–65, Aug.
2014.
8. J. Hua and Z. Huaxiang, ‘‘Analysis on the content features and their correlation
of Web pages for spam detection,’’ China Commun., vol. 12, no. 3, pp. 84–94.
9. P. P. K. Chan, C. Yang, D. S. Yeung, and W. W. Ng, ‘‘Spam filtering for short

messages in adversarial environment,’’ Neurocomputing, vol. 155, pp. 167–176,
May 2015.
10. S.-E.Kim,J.-T.Jo,andS.-H.Choi,‘‘SMSspamfilterinigusingkeyword frequency

ratio,’’ Int. J. Secur. Appl., vol. 9, no. 1, pp. 329–336, 2015.
11. O. Osho, O. Y. Ogunleke, and A. A. Falaye, ‘‘Frameworks for

mitigatingidentitytheftandspammingthroughbulkmessaging,’’inProc.IEEE6th Int.
Conf. Adapt. Sci. Technol. (ICAST), Oct. 2014, pp. 1–6.
12. R. Islam and J. Abawajy, ‘‘A multi-tier phishing detection and filtering
28
approach,’’J.Netw.Comput.Appl.,vol.36,no.1,pp.324–335,Jan.2013.
13. O. Osho, V. L. Yisa, O. Y. Ogunleke, and S. I. M. Abdulhamid, ‘‘Mobile

spamming in Nigeria: An empirical survey,’’ in Proc. Int. Conf. Cybersp. (CYBER-
Abuja), Nov. 2015, pp. 150–159.
14. S. J. Delany, M. Buckley, and D. Greene, ‘‘SMS spam filtering: Methods

anddata,’’ExpertSyst.Appl.,vol.39,no.10,pp.9899–9908,Aug.2012.
15. A.BantukulandP.J.Marsico,‘‘Methods,systems,andcomputerprogram products

for short message service (SMS) spam filtering using E-mail spam filtering
resources,’’ U.S. Patent 7751836 B2, Jul. 6, 2010.
16. H.-Y. Chou and N.-H. Lien, ‘‘Effects of SMS teaser ads on product
curiosity,’’Int.J.MobileCommun.,vol.12,no.4,pp.328–345,Jul.2014.
17. N. Jindal and B. Liu, ‘‘Review spam detection,’’ in Proc. 16th Int. Conf. World
Wide Web, 2007, pp. 1189–1190.
18. M. Jiang, P. Cui, and C. Faloutsos, ‘‘Suspicious behavior detection: Current

trends and future directions,’’ IEEE Intell. Syst., vol. 31, no. 1, pp. 31–39, Jan./Feb.
2016.
19. A. Liberati et al., ‘‘The PRISMA statement for reporting systematic reviews and
meta-analyses of studies that evaluate healthcare interventions: Explanation and
elaboration,’’ Ann. Internal Med., vol. 151, pp. W-65–W-94, Jun. 2009.
20. L. Hartling, C. Spooner, L. Tjosvold, and A. Oswald, ‘‘Problem-based

learninginpre-clinicalmedicaleducation:22yearsofoutcomeresearch,’’ Med. Teacher,
vol. 32, no. 1, pp. 28–35, 2010.
21. A. Karami, A. Amir, and L. Zhou, ‘‘Improving static SMS spam detection by
using new content-based features,’’ in Proc. AISeL, 2014, pp. 1–9.
22. A. A. Al-Hasan and E.-S. M. El-Alfy, ‘‘Dendritic cell algorithm for mobile
phone spam filtering,’’ Procedia Comput. Sci., vol. 52, no. 1, pp. 244–251, 2015.
29

Miniproject Thirukumaran

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Miniproject Thirukumaran

Uploaded by

Copyright:

Available Formats

SPAM FILTERING USING MACHINE LEARNING

MINI PROJECT REPORT

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

BANNARI AMMAN INSTITUTE OF TECHNOLOGY

(An Autonomous Institution Affiliated to Anna University, Chennai)

Certified that this project report “ SPAM FILTERING USING MACHINE

Dr. S. LOGESHWARI M.SURIYA

HEAD OF THE DEPARTMENT SUPERVISOR

DEPARTMENT OF COMPUTER DEPARTMENT OF COMPUTER

Submitted for Viva Voice examination held on ________________

INTERNAL EXAMINER EXTERNAL EXAMINER

SATHYAMANGALAM – 638 401

We hereby declare that the project report entitled “Spam Filtering

CHAPTERNO. TITLE PAGENO

1.1 GENERAL AIM1

1.2 Definition and terminologies 1

1.2.1 SMS over e-mail 1

1.3 Project motivation 1

1.4 Spam words 2

2.1 SOLUTION BY DIX IT 5

2.2 FREQUENCY ANALYSIS IN 20105

2.3 SOLUTION TO POINT-POINT MESSAGES6

2.4 FILTERING OTHER TEXT CLASSIFICATION

2.5 SMS SPAM DATA 7

2.5.1 Analysis of an SMS spam corpus 7

3. FROM MAIL TO MESSAGE FILTERING 9

3.1 SIMILARITIES TECHNIQUES USED

IN MAIL AND SMS 9

3.2 DIRECT FILTERING TECHNIQUE 9

3.2.1 Automatic text classification 9

3.3 COLLABORATIVE FILTERING 10

3.3.1 Technical issue 10

3.3.2 Message length 11

3.4 CLIENT SIDE SOLUTIONS TO SPAM FILTERING 11

4. CONTENT BASED SMS SPAM FILTERING 13

4.1 WORK PROPOSING THE APPLICATION 13

4.1.1 Spam data set 13

4.2 K-NEAREST NEIGHBOR ALGORITHM: 13

4.2.1 Dual filtering methods 14

4.2.2 Evolutionary classifiers 14

4.3 U CS-SUPERVISED CLASSIFIER SYSTEM: 14

4.3.1 Comparison of administered learning calculations: 15

4.3.2 Thirteen classifiers 15

4.3.3 Three -top ranked algorithm 15

4.4 CHALLENGE RESPONSE SYSTEM: 16

4.4.1 Limitations to the challenge response system 16

4.4.2 Solutions installed on client’s mobile device 16

4.5 FEATURE ENGINEERING IN SMS SPAM 17

The development of cell phone clients has prompt a sensational expanding of

SMS spam messages. By and by, battling portable telephone spam is

permitted numerous clients and administration suppliers to disregard the issue,

and the restricted accessibility of cell phone spam-separating programming.

Then again, in scholastic settings, a noteworthy impair is the shortage of open

of various classifiers. Also, as SMS messages are genuinely short, content-

followed does not lead to near-duplicates and, consequently, the proposed

by different classifiers. In addition, we analyse the execution accomplished by a

few built up machine learning strategies. The outcomes demonstrate that

Support Vector Machine outflanks other assessed classifiers what's more,

henceforth, it tends to be utilized as a decent pattern for further Correlation.

Keywords:Spam filtering, Text classification, SMS spamdataset

TABLE NO. TITLE PAGE NO.

1.1 Project Motivation 2