You are on page 1of 27

COMPUTER SCIENCE AND ENGINEERING

Hate Speech Detection of Cyberbully using Natural


Language Processing

Presented by:
• 110118104052 – Seyad Mohammed Faheem
• 110118104053 – Shaik Zahid Mentored by:
• 110118104055 – Shuaib
Asst. Prof. Pasupathi
• 110118104062 – Umar Ali Ahamed K M
AGENDA
• Base Paper Details
• Abstract
• Requirements
• Introduction
• Domain
• Problem Statement
• Proposed System
• Steps and Functions
• Flow Chart
• Modules
• Data Collection
• Data preprocessing
• Model Selection
• Conclusion

2
BASE PAPER DETAILS
Paper name : Detection of Cyberbullying on Social Media Using Machine learning
Published by : Varun Jain, Vishant Kumar, Vivek Pal, Dinesh Kumar Vishwakarma

Published in : 5th International Conference on Computing Methodologies and Communication


(ICCMC)

DOI : 10.1109/ICCMC51019.2021.9418254.

Publication Year : 2021

Base Paper URL:https://ieeexplore.ieee.org/document/9418254

3
INTRODUCTION

• Cyberbullying is bullying with the use of digital technologies. It can take place on social media,
messaging platforms, gaming platforms and mobile phones. It is repeated behavior, aimed at
scaring, angering or shaming those who are targeted. Detection of this kind of behavior online
helps prevent it from affecting victims.
• Domain:
• Machine Learning
• Natural Language Processing
• Problem Statement:
• To develop Machine learning model to detect and classify the tweet into Cyberbully
or not using Natural Language Processing and Machine Learning Techniques.

4
ABSTRACT
• Cyberbullying frequently leads to serious mental and physical distress, particularly for women and

children, and even sometimes force them to attempt suicide. Online harassment attracts attention due to its

strong negative social impact. Many incidents have recently occurred worldwide due to online harassment,

such as sharing private chats, rumors, and sexual remarks. Therefore, the identification of bullying text or
message on social media has gained a growing amount of attention among researchers. This System is

powered by Machine learning Model which predicts and Classifies the text containing any Hate Speech or

offensive language which leads to cyberbullying.. This project focuses on development of a system for

automatic detection of cyberbullying considering the main characteristics of cyberbullying such as

intention to harm an individual by using abusive language or hate speech using Natural Language
Processing. 5
LITERATURE SURVEY

TITLE AUTHOR TECHNIQUES MERITS DEMERITs

Collaborative detection of A. Mangaonkar, A. Logistic regression, SVM, More efficiency Low Accuracy
cyberbullying behavior in Hayrapetian, and R. Raje, Naïve Baye’s
Twitter data,
Automatic detection of R. Zhao, A. Zhou, and K. SVM classifier with bullying More Reliability Inefficiency
cyberbullying on social Mao, features
networks based on
bullying features
From risk factors to A Ioannou, 2018 Neural network High accuracy High cost
detection and intervention:
a practical proposal for
future work on
cyberbullying
Optimal online DS Zois, 2018 ANN High accurate It takes more time
cyberbullying detection

6
Requirements
• Hardware Requirements:
• Processor : intel i3 or above
• Ram : 4GB or above
• Storage : 250 GB or above
• Software Requirements:
• Python 3.7
• VS code
• JuPyter Notebook
• Python packages:
a) NumPy
b) Pandas
c) Scikit-learn
d) NLTK

7
EXISTING SYSTEM:

• Researchers all over the world has seen this problem as very dangerous in social media due to
rapid increase in social media users, as much as the Number of social media users increases, this
problem also increases. So, they have come up with various solutions to solve this problem.Some
tried to solve this problem as simple text classification problem with naïve Baye’s classsifier with
Bag-of-Words.
• This solution is not efficient as it only managed to achieve 60% to 70 % of accuracy which is
certainly not enough.
• Some machine learning and deep learning experts solved this problem with much more better
accuracy using Convolution Neural Network and Artificial Neural Network which have managed
to achieve more than 80% of accuracy but the problem with these solutions is Developing and
Deploying this Solution is Quite expensive which restricts us to further upgrade the model as it
requires more computing power. As this System has to be working with large text data and needs to
be periodically updated and improved.
EXISTING SYSTEM(Contd.):

Drawbacks:
• The systems are expensive to train the dataset
• The Accuracy of the systems are around 60% - 80% which is enough to solve the problem
• The upgradation of this system in future will require high computation which prevents
periodic improvements.
PROPOSED SYSTEM
• The system which is being proposed in this project is much more efficient than the
existing systems. This can be achieved by using TF-IDF vectorizer which vectorizes the
text based on frequency of terms in the document and Inverse of Number of documents
contains the term, This is known as feature selection where instead of working with entire
Documents of texts, Important terms and non-important term can be identified which also
works as dimensionality reduction and helps Machine Learning algorithms to train
efficiently.
Advantages:
• High Accuracy of the System can be achieved.
• System only requires low computing power
• System allows periodic improvements in future.

10
SYSTEM ARCHITECTURE
Architecture Diagram
Data preprocessing Data collection DB

Split train test dataset

NLP process

Machine learning based


Evaluate metrics
Cyberbullying detection

11
MODULES

• Data Collection.
• Source
• Dataset Description
• Natural Language Processing.
• Data Cleaning
• Text Data Vectorization
• Model Selection.
• Various Models
• Models Accuracy
• Final Model

12
DATA COLLECTIOIN

• SOURCE:
• The data set for our project is Collected from website called Kaggle.
• DATASET DESCRIPTION:
• The data set we collected is in CSV format i.e., Comma Separated values with Tweet Text
and Class as fields and thousands of records
• Tweet Text contains raw tweets and each tweet is labelled in Class field specifically as
Hate Speech, offensive language or neither of them.

13
DATA COLLECTIOIN (contd.)
Data Set Sample
NATURAL LANGUAGE PROCESSING
• Data Cleaning:
• Simple text cleaning processes
• Stop-word removal
• Stemming
• Lemmatization

• Text Vectorization:
• TF-IDF – Term Frequency – Inverse Document Frequency

15
NATURAL LANGUAGE PROCESSING
RAW TEXT PROCESSED TEXT

!!! RT @mayasolovely: As a woman you shouldn't complain a a woman you shouldn t complain about clean up your hous
about cleaning up your house. & as a man you should amp a a man you should alway take the trash out
always take the trash out...

!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin boy dat cold tyga dwn bad for cuffin dat hoe in the st place
dat hoe in the 1st place!!

" & you might not get ya bitch back & thats that " amp you might not get ya bitch back amp that that

16
MODEL SELECTION
Model selection is process where instead of working on single algorithm to train
out model, we can work with multiple model and select and apply most suitable
model in terms of Accuracy.
• Various Machine Learning Algorithms:
• Naïve Baye’s
• Support Vector Machine
• Stochastic Gradient Classifier
• Random Forest Classifier
• Logistic Regression
• CatBoost
• LightGBM
• The dataset has to split into Train and test dataset, then For each Machine
Algorithms mentioned above, Then Train dataset is trained to develop a model
17
MODEL SELECTION (Contd.)

• Models Accuracy:
• Accuracy of model can be measured using various metrics, F-Score is one of
the metrics, which we used to measure Accuracy of our models. The formula
for the standard F1-score is the harmonic mean of the precision and recall. A
perfect model has an F-score of 1.

• Final Model
• Finally, The Model with greater accuracy is Selected for our prediction System.
This model is dumped as pickle for further usage in our system which predicts
and classify the input text into Specific Class Label

18
MODEL SELECTION (Contd.)

Accuracy Score

F1-Score

19
MODEL SELECTION (Contd.)

S.NO ML ALGORITHMS F1-SCORE

1 Naïve Baye’s 88

2 Logistic Regression 93

3 Stochastic Gradient Descent 94

4 Support Vector Machine 93

5 Random Forest 94

6 CatBoost 93

7 Light gbm 94

20
MODEL SELECTION (Contd.)

• From various Algorithms we tried , As comparing the Accuracy Score and F1-Score
of each Model , We can conclude that the model trained using Random Forest
Classifier , Stochastic Gradient Descent Classifier and Lightgbm have highest
accuracy than others with 94% of F1-Score.
• So, We can choose either of those models for our predictive system. For this
particular Dataset , We have decided to go with LightGBM Classifier , as it is
simple to train.
• We can save this model by pickling to use this further in our System

21
Final Result

• The Model has been trained and Selected , Now we can deploy our Model in our
Application, We have designed an web page to illustrate the application of our
Model.
• If we provide this website with text data , It can detect whether the text contains
any form of Hate Speech.
Final Result (Contd.)
Final Result (Contd.)
Final Result (Contd.)
CONCLUSION
The system for detection of cyberbullying is much needed in this Social Media era where
people have to go through numerous comments and criticisms which may harm their
mental health leading to social anxiety and low self-esteem. This System can be further
improved with larger dataset from social media and can be implemented into social
platforms to prevent and flag the post with hate and negativity
Future Expansion:
• The System proposed in this project can be further improved in Accuracy and Features by
using more and more data set. The larger the dataset , the finer the model.
• In this project, the model only capable of detecting text in English, with various dataset
and linguistic studies , our model can be further improved to achieve multi-lingual, So
that it can process text from other languages.
• This System can be implemented real time application such online public forums, Social
media platforms to prevent the occurrence of Cyberbully with Hate Speech.

26
REFERENCES
[1]A. Mangaonkar, A. Hayrapetian, and R. Raje, “Collaborative detection of
cyberbullying behavior in Twitter data,” 2015, doi: 10.1109/EIT.2015.7293405.
[2] R. Zhao, A. Zhou, and K. Mao, “Automatic detection of cyberbullying on social
networks based on bullying features,” 2016, doi: 10.1145/2833312.2849567.
[3] Andri Ioannou, Jeremy Blackburn, Gianluca Stringhini, Emiliano De Cristofaro,
Nicolas Kourtellis & Michael Sirivianos (2018) From risk factors to detection and
intervention: a practical proposal for future work on cyberbullying, Behaviour &
Information Technology, 37:3, 258-266, DOI: 10.1080/0144929X.2018.1432688[4]
[4]DSZois, Optimal online cyberbullying detection, 2018, DOI:
10.1080/018492X.2018.1433458

You might also like