Professional Documents
Culture Documents
SPAM
• Originating from the name of Hormel's canned meat,
• "spam" now also refers to junk e-mail or irrelevant postings to
a newsgroup or bulletin board.
• The unsolicited e-mail messages you receive about refinancing
your home, reversing aging, and losing those extra pounds are
all considered to be spam.
• Spamming other people is definitely not cool and is one of the
most notorious violations of Internet etiquette (or
"netiquette").
• So if you ever get the urge to let thousands of people know
about that hot new guaranteed way to make money on the
Internet, please reconsider.
One Solution to Spam Detection
• Machine Learning
– Learn spam versus good/ham
• Naïve Bayes
3
Advantages of Bayesian Method
• Bayesian approach is self adapting. It keeps learning from the new
spams.
• Bayesian method takes whole message into account.
• Bayesian method is easy to use and very accurate (Claimed Accuracy
Percentage is 97).
• Bayesian approach is multi-lingual.
• Reduces the number of false positives.
4
A Spam Filter
Dear Sir.
• Naïve Bayes spam filter
First, I must solicit your confidence in this
transaction, this is by virture of its nature as
• Data: being utterly confidencial and top secret. …
– Collection of emails, labeled
spam or ham
TO BE REMOVED FROM FUTURE MAILINGS,
– Note: someone has to hand SIMPLY REPLY TO THIS MESSAGE AND PUT
label all this data! "REMOVE" IN THE SUBJECT.
– Split into training, testing
sets 99 MILLION EMAIL ADDRESSES
FOR ONLY $99
spam Ham
2/4 ½ Password
¼ 2/2 Review
¾ ½ Send
¾ ½ Us
¾ ½ Your
1/4 0/2 Account
14
Construct Vocabulary
spam Ham
2/4 ½ Password
¼ 2/2 Review
¾ ½ Send
¾ ½ Us
¾ ½ Your
1/4 0/2 Account
Is it correct!!!!