You are on page 1of 4

International Journal of Engineering Trends and Technology (IJETT) Volume 5 number 1 - Nov 2013

ISSN: 2231-5381 Page 36

An Efficient Model Of Detection And Filtering Technique Over Malicious And
Spam E-Mails
V S Kumar, Ravi kumar
Final MTech Student, Assoc.Professor & Head of the Dept
Dept of CSE,Kakinada institute of Engineering & Technology ,Kakinada


Malicious mail detection and Filtering is an interesting
research work now a days. In this paper we are
proposing an efficient malicious mail detection system
through the (Supervised L earning) classification
approach. Our proposed approach efficiently handles the
spam and phishing emails based on the meta data
characteristics of email. These present filtering and
detection techniques are performing well under detection
of targeted malicious mails detection. Next another step
is present for detection of persistent and recipient
oriented attacks


Targeted email attacks to enable computer network
exploitation have become more prevalent, more
insidious, and more widely documented in recent years.
Beyond nuisance spam or phishing designed to trick
users into revealing personal information, targeted
malicious email (TME) facilitates computer network
exploitation and the gathering of sensitive information
from targeted networks. These targeted email attacks are
not singular unrelated events, instead they are
coordinated and persistent attack campaigns that can
span years. This dissertation surveys and categorizes
existing email ltering techniques, proposes and
implements new methods for detecting targeted
malicious email and compares these newly developed
techniques to traditional detection methods. Current
research and commercial methods for detecting
illegitimate email are limited to addressing Internet scale
email abuse, such as spam, but not focused on
addressing targeted malicious emails. Furthermore,
conventional tools such asanti-virus are vulnerability
focused examining only the binary code of an email but
ignoring all relevant contextual metadata.

Related work:

Classification is process of grouping together documents
or data that have similar properties or are related. Our
understanding of the data and documents become greater
and easier once they are classified. We can also infer
logic based on the classification. Most of all it makes the
new data to be sorted easily and retrieval faster with
better results.

Dewey Decimal Classification is the system most used in
the libraries. It is hierarchical; there are ten parent
classes which are further divided into ten further
divisions which also are in turn divided into ten sections.
Each book is assigned a number according to its class,
division and section alphabetically. Dewey Decimal
Classification is very successful in libraries but
unfortunately it cant be implemented in Information
Retrieval. Somebody needs to have a central catalogue
of all the documents in the web and whenever a new
document is added the central committee would have to
look at it classify it assign a number and publish it in the
web. This is in strong violation of the way the internet
works. Some authority controlling the contents of the
web will restrict the amount of data that can be added
into the web. We need a web that allows everyone to
upload their content in the web together with a Machine
Learning technique that finds these new data and
classifies them as they come.

Condentiality issues in data mining. A key problem that
arises in any en masse collection of data is that of
condentiality. The need for privacy is sometimes due to
law (e.g., for medical databases) or can be motivated by
business interests. However, there are situations where
the sharing of data can lead to mutual gain. A key utility
of large databases today is research, whether it be
scientic, or economic and market oriented. Thus, for
example, the medical eld has much to gain by pooling
data for research; as can even competing businesses with
mutual interests. Despite the potential gain, this is often
not possible due to the condentiality issues which arise.
We address this question and show that highly ecient
solutions are possible. Our scenario is the following: Let
P1 and P2 be parties owning (large) private databases D1
and D2. The parties wish to apply a data-mining
algorithm to the joint databases D1, D2 without
revealing any unnecessary information about their
individual databases. That is, the only information
learned by P1 about D2 is that which can be learned
from the output of the data mining algorithm, and vice
versa. We do not assume any trusted third party who
computes the joint output.

Bayesian spam filtering :This is a statistical technique of
Email filtering[18]. In the process of filtering , it makes
use of a naive bayes classifier[11] which classifies the
words and features to identify spam e-mail[19], an
approach commonly used in text classification. Naive
Bayes classifiers[17] work by correlating the use of
tokens (typically words, or sometimes other things), with
spam[20] and non-spam e-mails and then using Bayesian
International Journal of Engineering Trends and Technology (IJETT) Volume 5 number 1 - Nov 2013
ISSN: 2231-5381 Page 37

inference to calculate a probability that an email is or is
not spam.

Naive Bayes spam filtering: Naive Bayes spam filtering
is a filtering technique which deals with spam, that can
tailor itself to the email needs of individual users, and
gives low false positive spam detection rates that are
generally acceptable by the users. Particular words have
particular probabilities of occurring in spam email and in
genuine email. For instance, most email users will
frequently encounter the word viagra in spam email,
but will rarely seen it in other email. The filter doesn't
know these probabilities in advance, so the filter must be
trained so it can build them up. To train the filter, the
user must manually indicate whether a new email is
spam or not. For all words in each training email[9], the
filter will adjust the probabilities that each word will
appear in spam or legitimate email in its database. After
training, the word probabilities are used to compute the
probability that an email with a particular set of words in
it belongs to either category. Each word in the email
contributes to the email's spam probability, or only on
the most interesting words. This contribution is called
the posterior probability and is computed using bayes
theorem[7]. Then, the email's spam probability is
computed over all words in the email, and if the total
exceeds a certain threshold (say 95%),the filter will mark
the email as a spam. As in any other spam filtering
technique, email marked as spam can then be
automatically moved to a "Junk"[20] email folder, or
even deleted outright. Some software implement
quarantine mechanisms that define a time frame during
which the user is allowed to review the software's

Proposed work:

We are proposing an efficient malicious email detection
and filtering technique, for the classification of the
testing dataset which contains the metadata of the newly
received emails can be forwarded to training datasets
which has the previously unauthorized spam and
malicious mail meta data, analyze the attribute values
with respect to the probability of the attributes when
comparing with the testing dataset.

Datasets are the collection of tuples with
respect to different attributes and possible values for
each attribute, is given for the classification process for
analyzing the testing set behavior with machine learning
approach. Synthetic dataset can be gathered for the
classification of results Maintenance of Previously
accessed unauthorized malicious and spam emails at the
server end with respect to their meta data, which is used
for classifying the testing dataset meta data, when an
access made over the network also maintained in the
training dataset, because the data will helps in future
classification Targeting the persistent threats, introduces
Meta data environment. It Meta data structure
environment contains some fields of contents related
recipients. Those fields are email address, subject lines,
attach files etc. using the fields perform the verification
operation here. Using these conditions control all
different locations internet wide attacks here.

Dataset contains all related features of different
attacks. Using dataset only starts the training process
here. After completion of training process then perform
the detect of attacks. Those attacks related emails
classify here. That classification of related emails comes
under targeted malicious mails, non targeted malicious
mail, persistent and recipient oriented mails.

Dataset contains communication related emails in
between of customer to company. In total number of
dataset emails comes under anti spam those mails are
comes non targeted emails. Spam mail are comes under
targeted malicious mails. Repeated intrusion attempts are
identifies as persistent emails. Sender sends the content
repeated to particular recipient, those recipient mails are
contains high reputation values. Those reputation related
mails are comes under Recipient oriented mails here.
Here preprocessing is a process of extracting necessary
information (meta information) from the previously
received unauthorized, malicious mails at the receiver
end, like fields are email address, subject lines, attach
filesetc.Testing dataset contains the meta data of new
email details. They can be forwarded over training
dataset, by calculating the probability of attributes of the
testing and training dataset.

International Journal of Engineering Trends and Technology (IJETT) Volume 5 number 1 - Nov 2013
ISSN: 2231-5381 Page 38

1. In this study, trees grow to maximum size: k =
number of trees to create; m =number of random
features to select for node splitting; and d =maxi-
mum depth of the trees.

2. Select k vectors from the training data such that
vector k is chosen independent of
, ,
k 1.

3. For each of the bootstrap samples, grow a tree Tk,
where each node splits using the best split from m
randomly selected features. The result is multiple tree
classifiers Tk : h(x, k), where x is an input vector of
unknown classification.

4. To classify x, process that feature vector down
each tree in the forest. Each tree will output a clas-
sification, also known as a vote. If Ck(x) represents
the classification of the kth tree in the forest, then the
aggregate classification of the forest, Cforest(x) =
majority vote Cx{()}kk1.

The 83 features extracted from email are represented
as a vector of features. The output of the random
forest classifier for a particular email is binary,
classified as either TME or NTME using the emails
specific vector of persistent threat and recipient-
oriented features as input. When the classifier
correctly predicts a TME, its a true positive (TP).
When the classifier correctly predicts an NTME, its
a true negative (TN). When the classifier predicts an
NTME as TME, its a false positive (FP) or Type I
error. When the classifier predicts a TME as NTME,
its a false negative (FN) or Type II error. Table 3
shows the possible outcomes from the classifier.

The false positive rate (FPR) is the proportion of
NTME that was incorrectly classified as TME. The
specificity is equal to 1 FPR, where the FPR is
The false negative rate (FNR) is the proportion of
TME that was incorrectly classified as NTME. The
sensitivity is equal to 1 FNR, where the FNR is

Conclusion and Future work:

We concluded our research work with an efficient
probability based supervised learning approach by
classifying the testing dataset with training dataset.
We can enhance our approach by improving our
classification approach like improved nave Bayesian
classifier, c4.5algorithms with their probability and
classification measures

1. Targeted Trojan Email Attacks, briefing 08/2005,
Natl Infrastructure Security Co-ordination Centre,
2. Targeted Trojan Email Attacks, tech. cybersecurity
alert TA05-189A, US-CERT, 2005;
3. J.A. Lewis, Holistic Approaches to Cybersecurity
to Enable Network Centric Operations, statement
before Armed Services Committee, Subcommittee on
Terrorism, Unconventional Threats and Capabilities,
110th Cong., 2nd sess., 1 April 2008.
4. 2009 Report to Congress of the U.S.-China
Economic and Security Review Commission, report,
Nov. 2009;
Mail details
International Journal of Engineering Trends and Technology (IJETT) Volume 5 number 1 - Nov 2013
ISSN: 2231-5381 Page 39
5. B. Krekel, Capability of the Peoples Republic of
China to Conduct Cyber Warfare and Computer
Network Exploitation, Oct. 2009;
6. I. Androutsopoulos et al., An Experimental
Comparison of Naive Bayesian and Keyword-Based
Anti-Spam Filtering with Personal E-mail
Messages, Proc. 23rd Ann. Intl ACM SIGIR Conf.
Research and Development in Information Retrieval,
ACM, 2000, pp. 160167.
7. R.M. Amin, Detecting Targeted Malicious Email
through Supervised Classification of Persistent
Threat and Recipient Oriented Features, PhD thesis,
Dept. Eng. and Applied Sciences, George
Washington Univ., 2011.
8. L. Breiman, Random Forests, Machine Learning,
vol. 45, no. 1, 2001, pp. 532.
9. T. Hastie, R. Tibshirani, and J. Friedman, The
Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd ed., Springer, 2008.
10. E. Hutchins, M. Cloppert, and R. Amin,
Intelligence-Driven Computer Network Defense
Informed by Analysis of Adversary Campaigns and
Intrusion Kill Chains, Proc. 6th Intl Conf.
Information Warfare and Security (ICIW 11),
Academic Conferences, 2011, pp. 113125.


Malireddi V S kumar
completed his BTech in Sri
Sai Aditya Institute Of
Science &
pursuing MTech in
Kakinada institute of Engineering &
Technology. Interesting research areas are
Data mining and Network security.

Mr.K.Ravi Kumar currently
working as Assoc.
Professor & Head of the
Department in Kakinada institute of
Engineering & Technology, Completed his
BTech and MTech in Pragati Engg College,
Surampalem. Area of Interests are Data Mining
Bioinformatics & Networking.