Professional Documents
Culture Documents
Sms Spam Detection
Sms Spam Detection
ipynb - Colaboratory
# importing Libraries
/Users/Desh/Downloads
spam_test_dataframe = pd.read_csv(os.getcwd() + '/sms_spam.csv',names= ['label', 'feature']) # read csv as we have a csv file to read
label feature
0 type text
As we can see that first row contains header, we need to remove them first.
spam_test_dataframe.dropna()
spam_test_dataframe=spam_test_dataframe.iloc[1:]
spam_test_dataframe.isnull().values.any()
False
spam_test_dataframe.head()
label feature
https://colab.research.google.com/drive/1_17XjEbpMmjFruDPGXgjIHmMVyAx5q2S#scrollTo=3XbLc_sCVpep&printMode=true 1/7
7/14/23, 10:29 PM SMS SPAM DETECTION.ipynb - Colaboratory
Data Analysis
First we will analyse the data which we have received and then we will do machine learning on it.
label feature
unique 2 5156
freq 4812 30
feature
label
As suspected length of message could bre really useful to identify the spam or a ham sms.
Let us add another column called length of feature which will have how much does the message length is.
spam_test_dataframe['length'] = spam_test_dataframe['feature'].apply(len)
spam_test_dataframe.head()
5 spam okmail: Dear Dave this is your final notice to... 161
Data Visualization
Lets Analyse the data before we do some machine learning
spam_test_dataframe['length'].plot(bins=100, kind='hist')
<matplotlib.axes._subplots.AxesSubplot at 0x1a1f01c0b8>
spam_test_dataframe.length.describe()
count 5559.000000
mean 79.781436
std 59.105497
min 2.000000
25% 35.000000
50% 61.000000
75% 121.000000
https://colab.research.google.com/drive/1_17XjEbpMmjFruDPGXgjIHmMVyAx5q2S#scrollTo=3XbLc_sCVpep&printMode=true 2/7
7/14/23, 10:29 PM SMS SPAM DETECTION.ipynb - Colaboratory
max 910.000000
Name: length, dtype: float64
spam_test_dataframe[spam_test_dataframe['length'] == 910]['feature'].iloc[0]
"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing
which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when
my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my
happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the
craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole
planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life
will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a
lot..will tell later.."
So this SMS message was sent by one person to other personally so it is ham. But this does not help much in Ham Spam identification
Our data is text data so first it should be in a vector format which is then input to machine learning model. In this section we'll convert the raw
messages (sequence of characters) into vectors (sequences of numbers).
Step1: Do some preprocessing like removing punctuation, stop words etc. Step2: Do some advance text processing like converting to bag of
words, N-gram etc. Step3: Machine Learning model fit adn transform Step4: Model accuracy check. Let's first start with with step 1 and then
rest will follow.
import string
from nltk.corpus import stopwords
# text pre-processing
spam_test_dataframe['feature'] = spam_test_dataframe['feature'].str.replace('[^\w\s]','')
spam_test_dataframe['feature'] = spam_test_dataframe['feature'].apply(lambda x: " ".join(x.lower() for x in x.split()))
stop = stopwords.words('english')
spam_test_dataframe['feature'] = spam_test_dataframe['feature'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
Lets do something even better beside above techniques, lets shorten the terms to their stem form.
https://colab.research.google.com/drive/1_17XjEbpMmjFruDPGXgjIHmMVyAx5q2S#scrollTo=3XbLc_sCVpep&printMode=true 3/7
7/14/23, 10:29 PM SMS SPAM DETECTION.ipynb - Colaboratory
spam_test_dataframe['feature'] = spam_test_dataframe['feature'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
Each vector will have as many dimensions as there are unique words in the SMS corpus. We will first use SciKit Learn's CountVectorizer. This
model will convert a collection of text documents to a matrix of token counts.
We can imagine this as a 2-Dimensional matrix. Where the 1-dimension is the entire vocabulary (1 row per word) and the other dimension are
the actual documents, in this case a column per text message.
For example:
Message 1 Message 2 ... Message N
Word 1 Count 0 1 ... 0
Word 2 Count 0 0 ... 0
... 1 2 ... 0
Word N Count 0 1 ... 1
Since there are so many messages, we can expect a lot of zero counts for the presence of that word in that document. Because of this, SciKit
Learn will output a Sparse Matrix.
we will use pipelines to make the steps short. The pipeline will do the vectorization, Term frequcz transformation and model fitting together.
After the counting, the term weighting and normalization can be done with TF-IDF, using scikit-learn's TfidfTransformer .
So what is TF-IDF?
TF-IDF stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text
mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance
increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance
given a user query.
One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are
variants of this simple model.
Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word
appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF),
computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term
appears.
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible
that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document
length (aka. the total number of terms in the document) as a way of normalization:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important.
However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh
down the frequent terms while scale up the rare ones, by computing the following:
Example:
Consider a document containing 100 words wherein the word cat appears 3 times.
The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one
thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the
product of these quantities: 0.03 * 4 = 0.12.
Let's go ahead and see how we can do this in SciKit Learn: To transform the entire bag-of-words corpus into TF-IDF corpus at once:
https://colab.research.google.com/drive/1_17XjEbpMmjFruDPGXgjIHmMVyAx5q2S#scrollTo=3XbLc_sCVpep&printMode=true 4/7
7/14/23, 10:29 PM SMS SPAM DETECTION.ipynb - Colaboratory
The test size is 20% of the entire dataset (1112 messages out of total 5559), and the training is the rest (4447 out of 5559). Note the default
split would have been 30/70.
Training a model
With messages represented as vectors, we can finally train our spam/ham classifier. Now we can actually use almost any sort of classification
algorithms. For a variety of reasons, the Naive Bayes classifier algorithm is a good choice.
Using scikit-learn here, choosing the Naive Bayes, Decision Tree, Random Forest, Support Vector Machine classifiers to start with: In the end we
will compare the accuracy of each model.
Naive Bayes
# Pipelining
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()), ('clf', MultinomialNB()),])
text_clf = text_clf.fit(X_train, y_train)
# using GridSearch CV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False), 'clf__alpha': (1e-2, 1e-3),}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(X_train, y_train)
gs_clf.best_score_
gs_clf.best_params_
predicted_nb = gs_clf.predict(X_test)
print(predicted_nb)
Decision Tree
predicted_dt = dt.predict(X_test)
print(predicted_dt)
Random Forest
# Pipelining
rf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
('clf-rf', RandomForestClassifier(n_estimators = 100, max_depth=5, random_state = 42)),])
_ = rf.fit(X_train, y_train)
predicted_rf = rf.predict(X_test)
print(predicted_rf)
# using SVM
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
('clf-svm', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, max_iter=5, random_state=42)),])
_ = text_clf_svm.fit(X_train, y_train)
https://colab.research.google.com/drive/1_17XjEbpMmjFruDPGXgjIHmMVyAx5q2S#scrollTo=3XbLc_sCVpep&printMode=true 5/7
7/14/23, 10:29 PM SMS SPAM DETECTION.ipynb - Colaboratory
predicted_svm = text_clf_svm.predict(X_test)
print(predicted_svm)
Classification Report
Lets develop the classification report for all above models
Accuracy Score
https://colab.research.google.com/drive/1_17XjEbpMmjFruDPGXgjIHmMVyAx5q2S#scrollTo=3XbLc_sCVpep&printMode=true 6/7
7/14/23, 10:29 PM SMS SPAM DETECTION.ipynb - Colaboratory
So, our model predicted very well on the dataset with an accuracy about 86%. If we fine tune our model, our accuracy could increase.
https://colab.research.google.com/drive/1_17XjEbpMmjFruDPGXgjIHmMVyAx5q2S#scrollTo=3XbLc_sCVpep&printMode=true 7/7