Synopsis Waheed Abbas Fa17-Rcs-013

1
Toxic Comment Classification

Of Roman Urdu Text
Presented By:
Waheed Abbas
FA17-RCS-013
Supervisor:
Dr. Rao Muhammad Adeel Nawab
Co-Supervisor:
Mr. Muhammad Sharjeel
2
 Introduction
 Research Focus
 Literature Review
Agenda Toxic Comment Classification
Of Roman Urdu Text
 Problem Statement
 Research Methodology
 Estimated Time Table
Toxic Comment Classification Of Roman Urdu Text 3
Task Description
INPUT: Roman Urdu text of varying length
OUTPUT: Classification of text

Label (Toxic = 1, Non-Toxic = 0)
GOAL: Supervised learning task, learn from input

to predict output
Tm pagal insan ho Toxic
Aj aap bohat ache lag rahe ho Non-Toxic

4
Introduction
What Is Toxic Comment?
A rude, disrespectful, or unreasonable comment

that is likely to make you leave a discussion.
- perspectiveapi.com
Why Toxic Comment Classification is Important?
We are living in the era of technology which brings us many platforms (social
media and discussion groups) where we not only share our personal lives but
also engage in discussion and get opinions on many matters.
Unfortunately, these online communities also have a common occurrence of

harassment.
66%
Seen Harassment
41% Personally
Online Experienced It
Pew Research Center report 2017
Examples Of Toxic Comments
Examples of few toxic comments are given below:
1 Lanat ha tari shakal per kamini orat.
Dunyia me bohat kamina chawal insan dakhy magar yar tumhrai bat hi
2 kuch or ha lanti...
3 Yeh MQM waley kuttay ki dum ki tarah hain. Yeh kabi theak nai ho saktay.
Botni k mar q nai jaty ye roz koi na koi drama ho raha hota ha... kon c
4 manhoos ghaere te jb to PTI main shamil hua tha.
5 Barwaaay blackmailer bohat bari film hay tu…

Examples Of Non-Toxic Comments
Examples of few non-toxic comments are given below:
Pak army Pak ISI zindabad. Hamin apni ISI or army pr Allah k bad bharosa
1 hy Inshallah ab wohe ho ga jo Pak chahy ga.
2 Pagal kar doge hume kisi din tum Hansa Hansa ke yaar...
3 Zabardast yar matlab kamal kr diya.
Hum gahreeb ek kanal nhi le sakhte. Lekin phr b Allah ka shukr ha jo

4 chota sa ghar dia ha ye bht ha.
Buht umda bhai ma Pakistan se ho tmhari video regularly Dekhta hu. Is

5 waqar zaka ko jitna b roost kro ye nhi badlta esa hi ha.
Applications Of Toxic Comment Classification?
Non-Toxic
Communication in
Online Community
Non-Toxic
Chat
Application
Sentiment
Analysis
10
Research Focus
Research Focus
Toxic comment classification of Roman Urdu text

12
Literature Review
Literature Review
Topic Year Dataset Technique Evaluation
Accuracy:
CNN – 0.912 %
TCC
Convolutional Neural Networks for Toxic CNN, kNN, LDA, NB, kNN – 0.697 %
2018 Competition
Comment Classification SVM LDA – 0.808 %
Kaggle dataset
NB – 0.719 %
SVM – 0.811 %
Accuracy:
TCC
CNN – 0.981 %
Challenges for Toxic Comment Classification: An Competition CNN, LSTM,
2018 LSTM – 0.980 %
In-Depth Error Analysis Kaggle dataset Bidirectional GRU
GRU – 0.983 %
LR – 0.975 %
F1 Score:
Predictive Embeddings for Hate Speech Twitter HATE LR, GRU, LR - 0.85 %
2018
Detection on Twitter dataset TWEM (Proposed) GRU – 0.89 %
TWEM – 0.92 %
Literature Review
Topic Year Dataset Technique Evaluation
Accuracy:
100K High LR (n-gram, word-
LR Word – 94.6 %
quality human gram),
Ex machina: Personal attacks seen at scale 2017 LR Char – 96.1 %
annotated MLP (n-gram, word-
MLP Word – 95.2 %
dataset gram)
MLP Char – 95.9 %
Twitter tweets
totalling 6655
(Racism 91, F1 Score:
Using convolutional neural networks to classify
2017 Sexism 946, CNN, LR (n-gran) CNN – 0.782 %
hate-speech Both 18, LR – 0.738 %
Non hate 5600)
Limitation Of Previous Work
Focus is on English and other European language
Lack of benchmark corpora for Roman Urdu

- To develop and evaluate toxic comment classification methods
Lack of standard techniques for Roman Urdu toxic comment

classification
16
Problem Statement
Problem Statement
Develop standard evaluation resources and

methods for the Roman Urdu toxic comment
classification.
18
Research Methodology
Research Methodology
Extraction of Annotate data and

potential data generate standardized Result & Analysis
Roman Urdu Comments
corpus Precision, Recall, F1-
Mesaure
Start End
Collection of raw Make annotation Apply state-of-the-art

data from guidelines techniques on corpus
different domains e.g. Neural Networks,
e.g. Facebook, Twitter Ensemble Method
and YouTube
Dataset
Minimum 10K high quality annotated dataset of

toxic and non-toxic comments in Roman Urdu.
21
Estimated Timetable
Estimated Timetable
SEP OCT NOV DEC JAN FEB MAR APR

2018 2018 2018 2018 2019 2019 2019 2019
Comprehensive Literature Review Making Annotation Guidelines Result & Analysis

Writing and Submission of Synopsis Annotate Corpus Thesis Writeup
Raw Data Collection Standardize the Corpus (Preprocessing) Final Submission
Selection of Potential Data (Roman Urdu) Implementation of Technique
23
Bibliography
Bibliography
A H. Hosseini, S. Kannan, B. Zhang, and R. Poovendran, “Deceiving Google’s Perspective API Built for Detecting Toxic Comments,” arXiv Prepr.
arXiv1702.08138, 2017.
Z. Waseem and D. Hovy, “Hateful symbols or hateful people? predictive features for hate speech detection on twitter,” in Proceedings of the NAACL
student research workshop, 2016, pp. 88–93.
M. Duggan, “Online Harassment,” Pew Research Center, 2014.
E. Wulczyn, N. Thain, and L. Dixon, “Ex machina: Personal attacks seen at scale,” in Proceedings of the 26th International Conference on World Wide
Web, 2017, pp. 1391–1399.
P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, “Deep learning for hate speech detection in tweets,” in Proceedings of the 26th International
Conference on World Wide Web Companion, 2017, pp. 759–760.
H. Zhong et al., “Content-Driven Detection of Cyberbullying on the Instagram Social Network.,” in IJCAI, 2016, pp. 3952–3958.
J. H. Park and P. Fung, “One-step and two-step classification for abusive language detection on twitter,” arXiv Prepr. arXiv1706.01206, 2017.
Y. Chen, Y. Zhou, S. Zhu, and H. Xu, “Detecting offensive language in social media to protect adolescent online safety,” in Privacy, Security, Risk and
Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), 2012, pp. 71–80.
S. V Georgakopoulos, S. K. Tasoulis, A. G. Vrahatis, and V. P. Plagianakos, “Convolutional Neural Networks for Toxic Comment Classification,” arXiv Prepr.
arXiv1802.09957, 2018.
Bibliography
T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate speech detection and the problem of offensive language,” arXiv Prepr.
arXiv1703.04009, 2017.
R. Kshirsagar, T. Cukuvac, K. McKeown, and S. McGregor, “Predictive Embeddings for Hate Speech Detection on Twitter,” arXiv Prepr. arXiv1809.10644,
2018.
J. Golbeck et al., “A large labeled corpus for online harassment research,” in Proceedings of the 2017 ACM on Web Science Conference, 2017, pp. 229–
233.
B. Gambäck and U. K. Sikdar, “Using convolutional neural networks to classify hate-speech,” in Proceedings of the First Workshop on Abusive Language
Online, 2017, pp. 85–90.
Z. Waseem, “Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter,” in Proceedings of the first workshop on
NLP and computational social science, 2016, pp. 138–142.
B. van Aken, J. Risch, R. Krestel, and A. Löser, “Challenges for Toxic Comment Classification: An In-Depth Error Analysis,” arXiv Prepr. arXiv1809.07572,
2018.
26
Thank You
Email: waheed0332@gmail.com
Contact: +923157022503

Synopsis Waheed Abbas Fa17-Rcs-013

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Synopsis Waheed Abbas Fa17-Rcs-013

Uploaded by

Copyright:

Available Formats

1

Toxic Comment Classification

INPUT: Roman Urdu text of varying length

OUTPUT: Classification of text

GOAL: Supervised learning task, learn from input

Tm pagal insan ho Toxic

Aj aap bohat ache lag rahe ho Non-Toxic

What Is Toxic Comment?

A rude, disrespectful, or unreasonable comment

Why Toxic Comment Classification is Important?

Unfortunately, these online communities also have a common occurrence of

Examples Of Toxic Comments

Examples of few toxic comments are given below:

1 Lanat ha tari shakal per kamini orat.

5 Barwaaay blackmailer bohat bari film hay tu…

Examples Of Non-Toxic Comments

Examples of few non-toxic comments are given below:

3 Zabardast yar matlab kamal kr diya.

Hum gahreeb ek kanal nhi le sakhte. Lekin phr b Allah ka shukr ha jo

Buht umda bhai ma Pakistan se ho tmhari video regularly Dekhta hu. Is

Applications Of Toxic Comment Classification?

Toxic comment classification of Roman Urdu text

Limitation Of Previous Work

Focus is on English and other European language

Lack of benchmark corpora for Roman Urdu

Lack of standard techniques for Roman Urdu toxic comment

Develop standard evaluation resources and

Extraction of Annotate data and

Collection of raw Make annotation Apply state-of-the-art

Minimum 10K high quality annotated dataset of

SEP OCT NOV DEC JAN FEB MAR APR

Comprehensive Literature Review Making Annotation Guidelines Result & Analysis

M. Duggan, “Online Harassment,” Pew Research Center, 2014.

You might also like