You are on page 1of 20

Natural Language Processing

Review-2
Prof. Saravanakumar K
Introduction
The problem we choose is spam mail filtering.The massive delivery of
unwanted content or advertising campaigns without the consent of target
users is known as spam email.
Problem Description
Because, through spam email a popular social network are used to distribute
Trojan viruses, a high number of advanced persistent threat phishing attacks,
and other risks for Internet users.This study classify the spam email from ham
email.
Paper-1
A new semantic-based feature selection method for spam filtering.

J.R.Méndez,T.R.Cotos-Yañez /-and D.Ruano-Ordás


Existing software
SpamAssassin or Wirebrush4SPAM . These products are able to bring together
a broad range of smart filtering techniques to accurately filter spam e-mails.
Spam filtering techniques
Spam filtering techniques are often classified into different groups including:

● content-based filtering,
● collaborative schemes,
● domain authorization methods and
● characteristics-based filters
Content-based filtering
Content-based filtering schemes are especially important because the
decision-making is based exclusively on the content of the message.

The ML methods for content-based filtering.

(i) extract information,

(ii) discard noisy, inconsistent, redundant data,

(iii) represent each message as a vector of features,

(iv) use a ML approach to automatically classify messages.


Approach of Author
They introduced a FS(Feature Selection) technique for a spam-filtering
domain that takes advantage of semantic information to classify.
Feature Selection
To unify the representation messages having different terms but the same
subject matter.
Modelling of topic feature
(i) Topic based Model

(ii) Semantic methods


Methodology
Workflow

(i) loading the corpus

(ii) email parsing process

(iii) email topic extractor and guesser

(iv) compute the topic-related significance of each feature.


Guessing the topic
They used the WordNet Lexical Database. It groups words into synsets.It
provides short definitions and usage examples of each word, defines different
kinds of semantic relations between synsets (nouns or verbs).

They focused on noun semantic relation.


Noun Semantic Relation
(i) hyponym (X is a kind of Y),

(ii) hypernym (X includes the notion of Y among others),

(iii) meronym (X is a part of Y),

(iv) holonym(X contains Y among others).


Calculation

where ,occurrences(Ti) represents the number of times a specific topic (Ti)


appears in a message, and #TM depicts the total number of different topics
present in the e-mail.
Evaluation
The confusion matrix brings together the number of different types of errors

(i)false positive errors (FP,legitimate messages classified as spam)

(ii) false negative errors (FN, undetected spam emails)

(iii) true positive hits (TP, number of spam messages detected)

(iv)true negative hits(TN,number of legitimate messages correctly classified)


Corpus

You might also like