You are on page 1of 26

1

Toxic Comment Classification


Of Roman Urdu Text

Presented By:
Waheed Abbas
FA17-RCS-013

Supervisor:
Dr. Rao Muhammad Adeel Nawab

Co-Supervisor:
Mr. Muhammad Sharjeel
2

 Introduction
 Research Focus
 Literature Review
Agenda Toxic Comment Classification
Of Roman Urdu Text
 Problem Statement
 Research Methodology
 Estimated Time Table
Toxic Comment Classification Of Roman Urdu Text 3

Task Description

INPUT: Roman Urdu text of varying length

OUTPUT: Classification of text


Label (Toxic = 1, Non-Toxic = 0)

GOAL: Supervised learning task, learn from input


to predict output

Tm pagal insan ho Toxic

Aj aap bohat ache lag rahe ho Non-Toxic


4

Introduction
Toxic Comment Classification Of Roman Urdu Text 5

What Is Toxic Comment?

A rude, disrespectful, or unreasonable comment


that is likely to make you leave a discussion.
- perspectiveapi.com
Toxic Comment Classification Of Roman Urdu Text 6

Why Toxic Comment Classification is Important?

We are living in the era of technology which brings us many platforms (social
media and discussion groups) where we not only share our personal lives but
also engage in discussion and get opinions on many matters.

Unfortunately, these online communities also have a common occurrence of


harassment.

66%
Seen Harassment
41% Personally
Online Experienced It
Pew Research Center report 2017
Toxic Comment Classification Of Roman Urdu Text 7

Examples Of Toxic Comments

Examples of few toxic comments are given below:

1 Lanat ha tari shakal per kamini orat.

Dunyia me bohat kamina chawal insan dakhy magar yar tumhrai bat hi
2 kuch or ha lanti...

3 Yeh MQM waley kuttay ki dum ki tarah hain. Yeh kabi theak nai ho saktay.

Botni k mar q nai jaty ye roz koi na koi drama ho raha hota ha... kon c
4 manhoos ghaere te jb to PTI main shamil hua tha.

5 Barwaaay blackmailer bohat bari film hay tu…


Toxic Comment Classification Of Roman Urdu Text 8

Examples Of Non-Toxic Comments

Examples of few non-toxic comments are given below:

Pak army Pak ISI zindabad. Hamin apni ISI or army pr Allah k bad bharosa
1 hy Inshallah ab wohe ho ga jo Pak chahy ga.

2 Pagal kar doge hume kisi din tum Hansa Hansa ke yaar...

3 Zabardast yar matlab kamal kr diya.

Hum gahreeb ek kanal nhi le sakhte. Lekin phr b Allah ka shukr ha jo


4 chota sa ghar dia ha ye bht ha.

Buht umda bhai ma Pakistan se ho tmhari video regularly Dekhta hu. Is


5 waqar zaka ko jitna b roost kro ye nhi badlta esa hi ha.
Toxic Comment Classification Of Roman Urdu Text 9

Applications Of Toxic Comment Classification?

Non-Toxic
Communication in
Online Community

Non-Toxic
Chat
Application

Sentiment
Analysis
10

Research Focus
Toxic Comment Classification Of Roman Urdu Text 11

Research Focus

Toxic comment classification of Roman Urdu text


12

Literature Review
Toxic Comment Classification Of Roman Urdu Text 13

Literature Review
Topic Year Dataset Technique Evaluation
Accuracy:
CNN – 0.912 %
TCC
Convolutional Neural Networks for Toxic CNN, kNN, LDA, NB, kNN – 0.697 %
2018 Competition
Comment Classification SVM LDA – 0.808 %
Kaggle dataset
NB – 0.719 %
SVM – 0.811 %

Accuracy:
TCC
CNN – 0.981 %
Challenges for Toxic Comment Classification: An Competition CNN, LSTM,
2018 LSTM – 0.980 %
In-Depth Error Analysis Kaggle dataset Bidirectional GRU
GRU – 0.983 %
LR – 0.975 %

F1 Score:
Predictive Embeddings for Hate Speech Twitter HATE LR, GRU, LR - 0.85 %
2018
Detection on Twitter dataset TWEM (Proposed) GRU – 0.89 %
TWEM – 0.92 %
Toxic Comment Classification Of Roman Urdu Text 14

Literature Review
Topic Year Dataset Technique Evaluation

Accuracy:
100K High LR (n-gram, word-
LR Word – 94.6 %
quality human gram),
Ex machina: Personal attacks seen at scale 2017 LR Char – 96.1 %
annotated MLP (n-gram, word-
MLP Word – 95.2 %
dataset gram)
MLP Char – 95.9 %

Twitter tweets
totalling 6655
(Racism 91, F1 Score:
Using convolutional neural networks to classify
2017 Sexism 946, CNN, LR (n-gran) CNN – 0.782 %
hate-speech Both 18, LR – 0.738 %
Non hate 5600)
Toxic Comment Classification Of Roman Urdu Text 15

Limitation Of Previous Work

Focus is on English and other European language

Lack of benchmark corpora for Roman Urdu


- To develop and evaluate toxic comment classification methods

Lack of standard techniques for Roman Urdu toxic comment


classification
16

Problem Statement
Toxic Comment Classification Of Roman Urdu Text 17

Problem Statement

Develop standard evaluation resources and


methods for the Roman Urdu toxic comment
classification.
18

Research Methodology
Toxic Comment Classification Of Roman Urdu Text 19

Research Methodology

Extraction of Annotate data and


potential data generate standardized Result & Analysis
Roman Urdu Comments
corpus Precision, Recall, F1-
Mesaure

Start End

Collection of raw Make annotation Apply state-of-the-art


data from guidelines techniques on corpus
different domains e.g. Neural Networks,
e.g. Facebook, Twitter Ensemble Method
and YouTube
Toxic Comment Classification Of Roman Urdu Text 20

Dataset

Minimum 10K high quality annotated dataset of


toxic and non-toxic comments in Roman Urdu.
21

Estimated Timetable
Toxic Comment Classification Of Roman Urdu Text 22

Estimated Timetable

SEP OCT NOV DEC JAN FEB MAR APR


2018 2018 2018 2018 2019 2019 2019 2019

Comprehensive Literature Review Making Annotation Guidelines Result & Analysis


Writing and Submission of Synopsis Annotate Corpus Thesis Writeup
Raw Data Collection Standardize the Corpus (Preprocessing) Final Submission
Selection of Potential Data (Roman Urdu) Implementation of Technique
23

Bibliography
Toxic Comment Classification Of Roman Urdu Text 24

Bibliography

A H. Hosseini, S. Kannan, B. Zhang, and R. Poovendran, “Deceiving Google’s Perspective API Built for Detecting Toxic Comments,” arXiv Prepr.
arXiv1702.08138, 2017.
Z. Waseem and D. Hovy, “Hateful symbols or hateful people? predictive features for hate speech detection on twitter,” in Proceedings of the NAACL
student research workshop, 2016, pp. 88–93.

M. Duggan, “Online Harassment,” Pew Research Center, 2014.

E. Wulczyn, N. Thain, and L. Dixon, “Ex machina: Personal attacks seen at scale,” in Proceedings of the 26th International Conference on World Wide
Web, 2017, pp. 1391–1399.
P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, “Deep learning for hate speech detection in tweets,” in Proceedings of the 26th International
Conference on World Wide Web Companion, 2017, pp. 759–760.

H. Zhong et al., “Content-Driven Detection of Cyberbullying on the Instagram Social Network.,” in IJCAI, 2016, pp. 3952–3958.

J. H. Park and P. Fung, “One-step and two-step classification for abusive language detection on twitter,” arXiv Prepr. arXiv1706.01206, 2017.

Y. Chen, Y. Zhou, S. Zhu, and H. Xu, “Detecting offensive language in social media to protect adolescent online safety,” in Privacy, Security, Risk and
Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), 2012, pp. 71–80.
S. V Georgakopoulos, S. K. Tasoulis, A. G. Vrahatis, and V. P. Plagianakos, “Convolutional Neural Networks for Toxic Comment Classification,” arXiv Prepr.
arXiv1802.09957, 2018.
Toxic Comment Classification Of Roman Urdu Text 25

Bibliography

T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate speech detection and the problem of offensive language,” arXiv Prepr.
arXiv1703.04009, 2017.
R. Kshirsagar, T. Cukuvac, K. McKeown, and S. McGregor, “Predictive Embeddings for Hate Speech Detection on Twitter,” arXiv Prepr. arXiv1809.10644,
2018.
J. Golbeck et al., “A large labeled corpus for online harassment research,” in Proceedings of the 2017 ACM on Web Science Conference, 2017, pp. 229–
233.
B. Gambäck and U. K. Sikdar, “Using convolutional neural networks to classify hate-speech,” in Proceedings of the First Workshop on Abusive Language
Online, 2017, pp. 85–90.
Z. Waseem, “Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter,” in Proceedings of the first workshop on
NLP and computational social science, 2016, pp. 138–142.
B. van Aken, J. Risch, R. Krestel, and A. Löser, “Challenges for Toxic Comment Classification: An In-Depth Error Analysis,” arXiv Prepr. arXiv1809.07572,
2018.
26

Thank You
Email: waheed0332@gmail.com
Contact: +923157022503

You might also like