You are on page 1of 21

SPAM

CLASSIFICATION
OUTLINE
• INTRODUCTION
• DATA EXPLORATION
• DATA WRANGLING
• MODEL DEVELOPMENT, TESTING & EVALUATION
• CONCLUSIONS
• REFERENCES
INTRODUCTION
The project is based on R and machine learning classification technique for text
classification as spam or non-spam(ham). The image spam classification is based on
Optical Character Recognition (OCR) which extracts the text from the images and further
use the text classification to specify the image as spam or non-spam. We have used Naive
Bayes classification & modelling techniques for text classification as spam or non-spam.
DATASET
The SMS Spam Collection is a set of SMS messages that have been collected for SMS Spam research. It contains
5,573 rows of messages classified as spam & non spam(ham). The files contain one message per line. Each line is
composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text. It has 1,002 SMS
ham messages and 322 spam messages 

For Spam Image Classification we have used a dataset with some spam
and some ham images embedded with spam or ham text.
LIBRARIES
• tm : R package library used for text mining i.e., tm_map(),
content_transformer()
• wordcloud : R package for word cloud generator
• RColorBrewer : R package used for color palette (in wordcloud)
• SnowballC : It is used to implement word stemming.
DATA EXPLORATION
DATA WRANGLING
Corpus Creation (Corpus is a large and unstructured set of texts used to do statistical analysis and
operations). We will create corpus for the main dataset.

Corpus cleaning #doc = tm_map(doc, toSpace, "*") corpus <- tm_map(corpus, content_transformer(tolower))

doc = tm_map(doc, toSpace, ",") corpus <- tm_map(corpus, removePunctuation)

doc = tm_map(doc, toSpace, "!") corpus <- tm_map(corpus, removeWords, stopwords("english"))

#doc = tm_map(doc, toSpace, ".") corpus2 <- tm_map(corpus2, removeNumbers)

#doc = tm_map(doc, toSpace, "-") corpus2 <- tm_map(corpus2,stemDocument)

doc = tm_map(doc, toSpace, "\\|")  

doc = tm_map(doc, toSpace, "/")

doc = tm_map(doc, toSpace, "#")

 
BEFORE CLEANING (with punctuations, symbols and
numbers)

AFTER CLEANING
DOCUMENT TERM MATRIX
A Document Term Matrix of a corpus is a matrix defining the elements of the rows with count.
the dimension of the matrix is
number of document x the number
of words in the corpus (5x 8155)

• This DTM is has 80% sparsity i.e.,


80% of its rows are zero.(because
majority of words appears ony in
few documents)
• Inspect command displays 1000 to
1005 terms in the first two rows of
the DTM.
TERM DOCUMENT MATRIX
 
A term document matrix is a way of representing the words in the text as a table (or matrix) of numbers.
The rows of the matrix represent the text responses to be analysed, and the columns of the matrix represent
the words from the text that are to be used in the analysis.

m <- as.matrix(tdm)
TO FIND FREQUENT TERMS

Length of frequency (Total no of terms)


Least Frequent occurring terms

Frequency of occurrence of each word in the corpus(sum of rows of each column)

Most Frequent occurring terms (in decreasing order)


Build a Document Term Matrix eliminating least frequent words (words occurring in 3 to 27 documents with word length
of 4 -20 characters)
WORDCLOUD

100 most frequently occurring words


SPAM IMAGE CLASSIFCATION
Text Extraction
#Get the dataframe with info
library(dplyr)
allInfo = image_info(image_read(allFiles))
setwd("D:/PGDBD 2SEM/project2/test")
all_im <- lapply(allFiles, load.image )
path <- "D:/PGDBD 2SEM/project2/test/"

#Get the list of all images all_im

allFiles = list.files(path, pattern = ".jpg", full.names = T) display(all_im)


read_ocr_png <- function(file){

img <- image_read(file)

image_ocr(img)

text_list <- lapply(allFiles , read_ocr_png)

text_list

write.csv(allFilesInfo, "allImageInfo.csv")

CONVERT TEXT FROM IMAGES INTO DATAFRAME


Similarly, we will convert the data frame into corpus and perform all the cleaning operations on it and calculate
DTM for the corpus.
MODEL DEVELOPMENT
MODEL TESTING & EVALUATION
confusionMatrix(sms_test_pred, sms_test_labels, dnn = c("predicted", "actual"))
CONCLUSION
The project shows that OCR is one of the methods to identify the text and classify it spam or
non-spam based on which the image can also be classified accordingly. Similarly, we can
classify images as spam or non-spam by identifying features of each in the dataset and
building a model accordingly for the classification. The project includes the Naïve Bayes
Classification algorithm, assuming that all predictor variables are independent of each other
and it is fast to implement for large datasets.
REFERENCES
[1] Support Vector Machines For Image Spam Analysis - San Jose State University SJSU Scholar works

[2] Image Spam Detection Using Machine Learning And Natural Language Processing - Journal Of

Southwest Jiao tong University

[3] rpubs.com

You might also like