Professional Documents
Culture Documents
CLASSIFICATION
OUTLINE
• INTRODUCTION
• DATA EXPLORATION
• DATA WRANGLING
• MODEL DEVELOPMENT, TESTING & EVALUATION
• CONCLUSIONS
• REFERENCES
INTRODUCTION
The project is based on R and machine learning classification technique for text
classification as spam or non-spam(ham). The image spam classification is based on
Optical Character Recognition (OCR) which extracts the text from the images and further
use the text classification to specify the image as spam or non-spam. We have used Naive
Bayes classification & modelling techniques for text classification as spam or non-spam.
DATASET
The SMS Spam Collection is a set of SMS messages that have been collected for SMS Spam research. It contains
5,573 rows of messages classified as spam & non spam(ham). The files contain one message per line. Each line is
composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text. It has 1,002 SMS
ham messages and 322 spam messages
For Spam Image Classification we have used a dataset with some spam
and some ham images embedded with spam or ham text.
LIBRARIES
• tm : R package library used for text mining i.e., tm_map(),
content_transformer()
• wordcloud : R package for word cloud generator
• RColorBrewer : R package used for color palette (in wordcloud)
• SnowballC : It is used to implement word stemming.
DATA EXPLORATION
DATA WRANGLING
Corpus Creation (Corpus is a large and unstructured set of texts used to do statistical analysis and
operations). We will create corpus for the main dataset.
Corpus cleaning #doc = tm_map(doc, toSpace, "*") corpus <- tm_map(corpus, content_transformer(tolower))
BEFORE CLEANING (with punctuations, symbols and
numbers)
AFTER CLEANING
DOCUMENT TERM MATRIX
A Document Term Matrix of a corpus is a matrix defining the elements of the rows with count.
the dimension of the matrix is
number of document x the number
of words in the corpus (5x 8155)
m <- as.matrix(tdm)
TO FIND FREQUENT TERMS
image_ocr(img)
text_list
write.csv(allFilesInfo, "allImageInfo.csv")
[2] Image Spam Detection Using Machine Learning And Natural Language Processing - Journal Of
[3] rpubs.com