You are on page 1of 19

R ES EA R C H

PA PE R
MD.MAHRUF HASAN
B EG
ID :0 1119 1 20 3
ABSTRACT
The recent development of social media poses new challenges to the research
community in analyzing online interactions between people . Social networking sites
offer great opportunities for connecting with others , but also increase the
vulnerability of young people to undesirable phenomena , such as cybervictimization .
Recent research reports that on average 20% to 40% of all teenagers have been
victimized online . In this paper we focus on cyberbullying as a particular form of
cybervictimization . Apart from describing our dataset construction and annotations,
we present proof-of-concept experiments on the automatic identification of
cyberbullying events and fine-grained cyberbullying categories .

Keywords-cyberbullying prevention ; text classification;dataset construction .


INTRODUCTION

The main objective of this research is to gain insight into


the linguistic characteristics of cyberbullying by collecting and
annotating an adequate dataset. This will allow us to explore
text characteristics (or features) that are potentially useful in
distinguishing between cyberbullying and non-cyberbullying
content
DATASET CONSTRUCTION
AND ANNOTATION
Data Collection : We constructed a corpus by collecting data from the social
networking site Ask.fm (http://ask.fm), by receiving donations and by setting up
simulation experiments with volunteer youngsters. In total, 91,370 Dutch posts were
collected.
Ask.fm A substantial part of our corpus was collected from the social networking site
Ask.fm where users can create profiles and ask questions and answer them, with
the option of doing so anonymously.
Donations Firstly, we launched a media campaign in which people were asked to donate
evidence of personal cases of cyberbullying. This resulted in a rather small but highly topical
set of messages including Facebook hate pages, message board posts and chat conversations.
Simulations Secondly, a series of simulation experiments were set up in which volunteer
teenagers were asked to participate in a cyberbullying simulation on a social networkby
means of a role-playing game.
DATA
ANNOTATIO
NS
The annotation scheme describes two levels of annotation. First, the annotators were asked to
indicate, at the post level, whether a post is part of a cyberbullying event. This was done by
assigning a harmfulness score to the post on a three-point scale, with 0 signifying that the post does
not contain indications of cyberbullying, 1 that the post contains indications of cyberbullying although
they are not severe, and 2 that the post contains serious indications of cyberbullying.
EXPERIM
ENTS
CONCLUSIONS

In this paper, we constructed a Dutch dataset of social media messages containing


cyberbullying and proposed and evaluated a methodology for adequate annotation
of this data.
ABSTR
ACT
The scourge of cyberbullying has assumed alarming proportions with an ever-
increasing number of adolescents admitting to having dealt with it either as a
victim or as a bystander. Anonymity and the lack of meaningful supervision in the
electronic medium are two factors that have exacerbated this social menace.
Comments or posts involving sensitive topics that are personal to an individual are
more likely to be internalized by a victim, often resulting in tragic outcomes. We
decompose the overall detection problem into detection of sensitive topics, lending
itself into text classification sub-problems. We experiment with a corpus of 4500
YouTube comments, applying a range of binary and multiclass classifiers. We find
that binary classifiers for individual labels outperform multiclass classifiers. Our
findings show that the detection of textual cyberbullying can be tackled by building
individual topic-sensitive classifiers
.
INTRODUCTION
In this paper, we focus on the detection of textual cyberbullying, which is one of the main forms
of cyberbullying. We use a corpus of comments from YouTube videos involvingsensitive topics
related to race & culture, sexuality and intelligence i.e., topics involving aspects that people
cannot change about themselves and hence become both personal and sensitive.

We decompose the detection of bullying into subproblems involving text classification. We


perform two experiments: a) training binary classifiers to ascertain if an instance can be
classified into a sensitive topic or not and b) multiclass classifiers to classify an instance from a
set of sensitive topics. Our findings show that individual classifiers that classify a given
comment into a specific label or not fare much better than multiclass classifiers involving a set
of labels.
COR
PUS
Using the YouTube PHP API, we scraped roughly a thousand comments from
controversial videos surrounding sexuality, race & culture and intelligence. We were
constrained by the limitations posed by YouTube of being able to download an upper
limit of up to a 1000 comments per video. The total number of comments downloaded
overall greater than 50,000 .
DATA PREPROCESSING
Annotation The comments downloaded from all the videos were arranged in a
randomized order prior to annotation. Two annotators of whom one was an educator
who works with middle school children, annotated each comment along the lines of
three labels defined as follows:
Sexuality: Negative comments involving attacks on sexual minorities and sexist
attacks on women.
Race and Culture: Attacks bordering on racial minorities (eg African-American,
Hispanic and Asian) and cultures (eg Jewish, Catholic and Asian traditions) including
unacceptable descriptions pertaining to race and stereotypical mocking of cultural
traditions.
Intelligence: Comments attacking the intelligence and mental capacities of an
individual.
CONCLUSIONS

In this paper, we focus on the problem of detecting textual cyberbullying in stand-


alone posts with a dataset of YouTube comments. We decompose the problem into
detection of topics that are of a sensitive and personal nature. Labels are of a
personal nature and instances that have a negative connotation and might include
profanity are likely to be an instance of cyberbullying.
PROPOSED METHODS

CNN: inspired by kim work on using CNNs for hate speech detection . We use the
same settings for cnn as described .

LSTM: unlike feed forward neural networks,recurrent neural networks like LSTMs can
use their internal memory to process arbitrary sequence of inputs . Hence we use
LSTMs to capture long range dependencies in tweets , which may play vital role in
hate speech detection .

You might also like