Professional Documents
Culture Documents
TEAMMEMBER: J.VIJAYAN
Phase-5:Document Submission
Introduction
“Building a Smarter AI-Powered Spam Classifier” is a comprehensive guide that walks
you through the process of creating an intelligent system capable of identifying and
filtering out spam emails. This book delves into the intricacies of machine learning
algorithms, natural language processing techniques, and data analysis methods used
to design and implement such a classifier.
By reading this book, you will gain a deep understanding of how AI can be harnessed
to combat spam. You’ll learn about the different types of spam, how they are
created, and how they can be detected. You’ll also get hands-on experience building
your own spam classifier using real-world datasets.
Process:
Email spam detection is a common problem in the field of machine learning. In this context,
we can use machine learning algorithms to classify emails into spam and non-spam
categories. One such algorithm is the Support Vector Machine (SVM) algorithm.
The SMS Spam Collection is a set of SMS tagged messages that have been collected
for SMS Spam research. It contains one set of SMS messages in English of 5,574
messages, tagged according to being ham (legitimate) or spam.
The data was obtained from UCI’s Machine Learning Repository, alternatively, I have
also uploaded the dataset and completed Jupiter notebook onto my GitHub repo.
Data Collection:
Gather a large and diverse dataset of emails or messages, labeled as spam or non-spam
(ham). This dataset is crucial for training your AI model.
Data Preprocessing
Clean and preprocess the data. This may involve tasks like removing special
characters, stemming, and tokenization.
Feature Extraction: Extract relevant features from the preprocessed data. Common
techniques include Bag-of-Words, TF-IDF (Term Frequency-Inverse Document Frequency),
or word embeddings like Word2Vec.
Training the Model: Train your chosen model using the preprocessed data. Use a portion
of the dataset for training and another portion for validation to fine-tune the model.
Evaluation:
Evaluate the model’s performance using metrics like accuracy, precision, recall, and F1-
score. This helps in understanding how well your model is performing.
Fine-Tuning:
Based on the evaluation results, fine-tune the model. This could involve adjusting
hyperparameters, trying different algorithms, or exploring advanced techniques like
ensemble methods.
data['label'].value_counts()
# OUTPUT
ham 4825
spam 747
Name: label, dtype: int64
Here are some steps to build a word cloud to detect spam messages:
Pre-process the Data:Pre-processing involves cleaning and transforming the raw text
data into a format that can be used by machine learning algorithms. This includes
removing stop words, stemming, and lemmatization 34.
Tokenize the Data: Tokenization is the process of breaking down the text into individual
words or phrases, known as tokens. This step is necessary for building a word cloud 3.
Create a Word Frequency Distribution: Count the frequency of each token in the text
data and create a word frequency distribution 3.
Visualize the Word Cloud: Use a Python library such as word cloud to create a word
cloud visualization of the most frequent words in the text data.
ham_words = ''
spam_words = ''
# Creating a corpus of spam messages
forvalin data[data['label'] == 'spam'].text:
text = val.lower()
tokens = nltk.word_tokenize(text)
for words in tokens:
spam_words = spam_words + words + ' '
let's use the above functions to create Spam word cloud and ham word cloud.
from the spam word cloud, we can see that "free" is most often used in spam.
Now, we can convert the spam and ham into 0 and 1 respectively so that the
machine can understand.
To remove stop words using NLTK, you can use the stopwords module, which contains a list
of common stop words in various languages. Here’s an example code snippet that
demonstrates how to remove stop words from a sentence:
Punctuation and stop words do not contribute anything to our model, so we have to
remove them. Using NLTK library we can easily do it.
importnltk
nltk.download('stopwords')
return" ".join(text)
data['text'] = data['text'].apply(text_process)
data.head()
Now, create a data frame from the processed data before moving to the next step.
text = pd.DataFrame(data['text'])
label = pd.DataFrame(data['label'])
TF-IDF is better than Count Vectorizers because it not only focuses on the
frequency of words present in the corpus but also provides the importance of the
words. We can then remove the words that are less important for analysis, hence
making the model building less complex by reducing the input dimensions.
total_counts = Counter()
foriinrange(len(text)):
for word intext.values[i][0].split(" "):
total_counts[word] += 1
print("Total words in data set: ", len(total_counts))
# OUTPUT
Total words in data set: 11305
# Sorting in decreasing order (Word with highest frequency appears
first)
vocab = sorted(total_counts, key=total_counts.get, reverse=True)
print(vocab[:60])
# OUTPUT
['u', '2', 'call', 'U', 'get', 'Im', 'ur', '4', 'ltgt', 'know', 'go',
'like', 'dont', 'come', 'got', 'time', 'day', 'want', 'Ill', 'lor',
'Call', 'home', 'send', 'going', 'one', 'need', 'Ok', 'good', 'love',
'back', 'n', 'still', 'text', 'im', 'later', 'see', 'da', 'ok',
'think', 'Ì', 'free', 'FREE', 'r', 'today', 'Sorry', 'week', 'phone',
'mobile', 'cant', 'tell', 'take', 'much', 'night', 'way', 'Hey',
'reply', 'work', 'make', 'give', 'new']
# Mapping from words to index
vocab_size = len(vocab)
word2idx = {}
#print vocab_size
fori, word inenumerate(vocab):
word2idx[word] = I
# Text to Vector
deftext_to_vector(text):
word_vector = np.zeros(vocab_size)
for word intext.split(" "):
if word2idx.get(word) isNone:
continue
else:
word_vector[word2idx.get(word)] += 1
returnnp.array(word_vector)
# Convert all titles to vectors
word_vectors = np.zeros((len(text), len(vocab)), dtype=np.int_)
fori, (_, text_) inenumerate(text.iterrows()):
word_vectors[i] = text_to_vector(text_[0])
word_vectors.shape
# OUTPUT
(5572, 11305)
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data['text'])
vectors.shape
# OUTPUT
(5572, 9376)
#features = word_vectors
features = vectors
Classifiers used:
1. spam classifier using logistic regression
2. email spam classification using Support Vector Machine(SVM)
3. spam classifier using naive bayes
4. spam classifier using decision tree
5. spam classifier using K-Nearest Neighbor(KNN)
6. spam classifier using Random Forest Classifier
We will make use of sklearn library. This amazing library has all of the above
algorithms we just have to import them and it is as easy as that. No need to worry
about all the maths and statistics behind it.
defpredict(clf, features):
return (clf.predict(features))
pred_scores_word_vectors = []
fork,vinclfs.items():
train(v, X_train, y_train)
pred = predict(v, X_test)
pred_scores_word_vectors.append((k, [accuracy_score(y_test ,
pred)]))
# OUTPUT
[('SVC', [0.9784688995215312]),
('KN', [0.9330143540669856]),
('NB', [0.9880382775119617]),
('DT', [0.9605263157894737]),
('LR', [0.9533492822966507]),
('RF', [0.9796650717703349])]
Model predictions
#write functions to detect if the message is spam or not
deffind(x):
if x == 1:
print ("Message is SPAM")
else:
print ("Message is NOT Spam")
newtext = ["Free entry"]
integers = vectorizer.transform(newtext)
x = mnb.predict(integers)
find(x)
# OUTPUT
Message is SPAM
output
Conclusion:
The number of people using mobile devices increasing day by day. SMS (short
messageservice) is a text message service available in smartphones as well as
basic phones. So,the traffic of SMS increased drastically. The spam messages
alsoincreased. Thespammers try to send spam messages for their financial or
business benefits like marketgrowth, lottery ticket information, credit card
information, etc. So, spam classificationhas special attention. In this paper, we
applied various machine learning and deeplearning techniques for SMS spam
detection. we used a dataset from UCI and build aspam detection model. Our
experimental results have shown that our LSTM modeloutperforms previous models
in spam detection with an accuracy of 98.5%. We usedpython for all
implementations.