You are on page 1of 16

SPAM Detection

SPAM
• Originating from the name of Hormel's canned meat,
• "spam" now also refers to junk e-mail or irrelevant postings to
a newsgroup or bulletin board.
• The unsolicited e-mail messages you receive about refinancing
your home, reversing aging, and losing those extra pounds are
all considered to be spam.
• Spamming other people is definitely not cool and is one of the
most notorious violations of Internet etiquette (or
"netiquette").
• So if you ever get the urge to let thousands of people know
about that hot new guaranteed way to make money on the
Internet, please reconsider.
One Solution to Spam Detection
• Machine Learning
– Learn spam versus good/ham

• Naïve Bayes

3
Advantages of Bayesian Method
• Bayesian approach is self adapting. It keeps learning from the new
spams.
• Bayesian method takes whole message into account.
• Bayesian method is easy to use and very accurate (Claimed Accuracy
Percentage is 97).
• Bayesian approach is multi-lingual.
• Reduces the number of false positives.

4
A Spam Filter
Dear Sir.
• Naïve Bayes spam filter
First, I must solicit your confidence in this
transaction, this is by virture of its nature as
• Data: being utterly confidencial and top secret. …
– Collection of emails, labeled
spam or ham
TO BE REMOVED FROM FUTURE MAILINGS,
– Note: someone has to hand SIMPLY REPLY TO THIS MESSAGE AND PUT
label all this data! "REMOVE" IN THE SUBJECT.
– Split into training, testing
sets 99 MILLION EMAIL ADDRESSES
FOR ONLY $99

• Classifiers Ok, Iknow this is blatantly OT but I'm


– Learn on the training set beginning to go insane. Had an old Dell
– Test it on new emails Dimension XPS sitting in the corner and
decided to put it to use, I know it was working
pre being stuck in the corner, but when I
plugged it in, hit the power nothing
happened.
Later in time

Coming before or earlier


Discrete example
Separate spam from valid email, attributes=words
• D1: “send us your password” Spam
• D2: “send us your review” ham
• D3: “review your password” ham
• D4: “review us” spam
• D5: “send your password” spam
• D6: “send us your account” spam
Construct Vocabulary

spam Ham
2/4 ½ Password
¼ 2/2 Review
¾ ½ Send
¾ ½ Us
¾ ½ Your
1/4 0/2 Account

Separate spam from valid email, attributes=words


P (spam)= 4/6 • D1: “send us your password” Spam
P (ham)= 2/6 • D2: “send us your review” ham
• D3: “review password” ham
• D4: “review us” spam
• D5: “send your password” spam
• D6: “send us your account” spam
Naïve Bayes
• Want P( spam | words)
• Use Bayes Rule: P(spam | words)  P( words | spam) P(spam)
P ( words)

P( words )  P( words | spam)  P( spam)  P( words | ham)  P( ham)

• Assume independence: probability of each word


independent of others
P( words | spam)  P( word1 | spam)  P(word 2 | spam)  ... P( wordn | spam)

14
Construct Vocabulary

spam Ham
2/4 ½ Password
¼ 2/2 Review
¾ ½ Send
¾ ½ Us
¾ ½ Your
1/4 0/2 Account

P (spam)= 4/6 New email: “review us now”


P (ham)= 2/6

P(review us|spam) = P( 0,1,0,1,0,0| spam) = (1-2/4)(1/4)(1-3/4)(3/4)(1-3/4)(1-1/4)


P(review us|ham) = P( 0,1,0,1,0,0| ham) = (1-1/2)(2/2)(1-1/2)(1/2)(1-1/2)(1-1/2)
P( words | ham) P(ham)
P(ham | words) 
P( words)

P(ham|review us) = 0.0625*2/6 divide by


0.0625*2/6+ 0.0044*4/6
= 0.87

Is it correct!!!!

You might also like