You are on page 1of 10

Youtube Spam Comment

Detection

Presented By : Presented To :
Gunik Maliwal Dr. Dibakar Saha
19223032
Introduction
YouTube is one of the famous and well-known social media website. Due to YouTube’s
popularity, it became a platform for spammer to distribute spam through the
comments on YouTube. This has become a concern because spam can lead to phishing
attack which the target can be any user that click any malicious link. Spam has its own
features that can be analyzed and detected by classification.
The Dataset
The dataset is pretty straightforward, it contains 2,000 comments from popular
Youtube videos, The dataset is formatted in a way where each row has a comment
followed by a value marked as 1 or 0 for spam or not spam,

The dataset can be downloaded from this website

http://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection

All the datasets that are being fetched in the code are sourced from my GitHub gist.
Data Cleaning
These columns have been dropped as they were not relevant:

'COMMENT_ID','

AUTHOR',

'DATE'

Also, the content of the comments have been filtered so that only alphabets remain as
to remove any special characters that can hinder with classification.
New Content

Columns after Cleaning:

Class and New Content


Using Naive Bayes Classifier
Naive Bayes classifier for multinomial models.

(sklearn.naive_bayes.MultinomialNB)
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word
counts for text classification). The multinomial distribution normally requires integer feature counts.
Advantages:
Low computation cost.
It can effectively work with large datasets.
Easy to implement, fast and accurate method of prediction.
It performs well in text classification problems.
Passing comment as vector array object to predict functions returns the result.
Deploying Web Application for our model using Python Flask
In the end, we want our model to be available for the end-users so that they can make
use of it. Model Deployment is one of the last stages of any machine learning project.

Flask is a web application framework written in Python. It has multiple modules that
make it easier for a web developer to write applications without having to worry about
the details like protocol management, thread management, etc.

Flask gives is a variety of choices for developing web applications and it gives us the
necessary tools and libraries that allow us to build a web application.
Screenshots
Code Repo : https://github.com/GunikthegEEk/PR-Term-Paper

You might also like