Professional Documents
Culture Documents
Corresponding Author:
Md. Nesarul Hoque
Department of Computer Science and Engineering, Faculty of Engineering, University of Chittagong
Chittagong-4331, Bangladesh
Email: mnhshisir@gmail.com
1. INTRODUCTION
Social communication platforms like Facebook, Instagram, YouTube, and others are now heavily used
for disseminating personal feelings and opinions. According to a report in 2021, about 2.9 billion users actively
interact with Facebook each month, and 2.4 billion and 1.4 billion people, respectively, prefer YouTube for
sharing videos and Instagram for photos. During this sharing, a person is often threatened or embarrassed
by sending unpleasant messages through electronic technology. Thus, the term cyberbullying comes, and it
is growing throughout the world. Sometimes, it results in extreme emotional and physical torment and even
self-destruction [1].
According to the report of NapoleonCat, in 2021, almost 47 million Bangladeshi users interact with
Facebook, and about 68.8% of these users are male. As per a survey of the Global Digital Statshot, in terms of
the highest number of active users, Dhaka, the capital of Bangladesh, ranked second among all cities globally
in 2017. The Daily Star, a leading newspaper in this country, reported in 2020 that there is a high incidence of
cyberbullying in this region, with 80% of the victims between the ages of 14 and 22 being females.
Although, cyberbullying is the rising issue in Bangladesh, few research articles have been published
to detect Bengali bully text. However, identifying the cyberbullying class is not the easy task, since extracting
the intention and the context of a particular text is still challenging jobs for the researchers. Furthermore, very
rare of the study focus on detail analysis of text pre-processing and feature selection levels from three aspects
of machine learning (ML) models: classical ML, deep learning (DL), and transformer learning [2]. Moreover,
most of these studies do not consider cross-validation technique to develop more generic detection model. In
addition, we have found very less analysis of hyper-parameter tuning of the DL models. In this research, we want
to handle these issues in order to build a strong cyberbullying text detection system for the Bengali language.
Several pieces of research have been undertaken to investigate bullying materials in the Bengali lan-
guage. Ahmed et al. [3] successfully identified bullying texts in the social media texts. Here, the authors used
a unique feature, the pointwise mutual information-semantic orientation (PMI-SO) score, besides the other
features like number of likes, and cyberbullying words. Then, they applied the extreme gradient boosting
(XGBoost) classifier and obtained a 93% accurate result. However, they did not apply cross-validation or
hyper-parameter tuning, and they did not try DL or transformer-based models in this work. Ahmed et al. [4]
made three cyberbullying detection systems for three different types of Bengali datasets: single-coded Ben-
gali, Romanized Bengali, and a mix of single-coded and Romanized Bengali. This is the unique contribution
in the case of cyberbullying detection. For single-coded Bengali, they proposed the convolutional neural net-
work (CNN) classifier (with an accuracy of 84%); however, for the other two dataset types, multinomial Naive
Bayes (MNB) showed superior performance: 84% for the Romanized Bengali and 80% for the mixed dataset.
They did not apply transformer-based pre-trained models and did not use cross-validation or hyper-parameter
optimization techniques in this research. Das et al. [5] worked on a hateful dataset and handled both binary
and multi-class classification problems. Here, they created the “Bangla Emot” module for processing emoji
and emoticons and provided an attention-based decoder model that demonstrated the best performance with an
accuracy of 88% in binary classification and 77% in multi-classification. Cross-validation and hyper-parameter
tuning were not taken into account. Romim et al. [6] conducted an experiment on hateful text. For extract-
ing the features, they applied fastText, BengFastText, and word2vec word embedding techniques. The authors
proposed the support vector machine (SVM) model that successfully achieved the highest accuracy of 87.5%
and the highest F1-score of 0.911. However, hyper-parameter tuning and cross-validation were not considered
in this work. Also, they did not experiment on the transformer-based models to detect the hateful text. Islam
et al. [7] utilized SVM model with term frequency-inverse document frequency (TF-IDF) feature extraction
technique and achieved 88% accuracy on an abusive text dataset. Authors did not perform cross-validation and
hyper-parameter optimization. Moreover, this research did not consider DL and transformer learning methods
to identify the abusive text. Kumar et al. [8] employed two shared task datasets, “HASOC” [9] and “TRAC-2”
[10], to identify offensive and aggressive text in Bengali, Hindi, and English languages. The principal success
of this research is the authors achieved the best result (0.93 F1-score) using a mix of character trigram and word
unigram with a SVM classifier for Bengali text. Although they applied 5-fold cross-validation, hyper-parameter
optimization was skipped. Chakraborty and Seddiqui [11] investigated abusive and threatening texts and devel-
oped a detection system through the linear SVM model with greater accuracy (78%), which is the main success
of this study. The authors did not use the cross-validation technique for generalizing the model and did not
tune the hyper-parameters during the model execution. Furthermore, they did not choose transformer learning
algorithms during model implementation.
Our goal in this study was to address every concern of the previous studies in this area. With this in
mind, we conduct experiments on three levels: pre-processing tasks, feature selection tasks, and implementation
of cutting-edge ML models. Ultimately, we contribute the following to the establishment of a very effective
cyberbullying detection system:
− Pre-processing level: incorporate some additional pre-processing tasks such as removing thin-space Unicode
characters, replacing emoticons and emojis by the Bengali words, with the general pre-processing tasks.
Detecting cyberbullying text using the approaches with ... (Md. Nesarul Hoque)
360 r ISSN: 2252-8938
− Feature selection level: features are extracted for the classical ML, DL, and transformer leaning classifiers
separately.
− Cross-validation technique: perform 10-fold cross-validation approach for classical ML models for more
reliable detection system.
− Hyper-parameter tuning: perform hyper-parameter tuning using “Bayesian optimization” algorithm during
the implementation of DL models.
− Selection of best models: propose the better detection approaches from each three type of ML model.
2. METHOD
For developing a strong detection system, at first we collect the dataset. Then, we apply various
text pre-processing tasks. After that, we extract the features from the pre-processed data. Finally, we apply
different types of classifier models with 10-fold cross-validation and hyper-parameters tuning using Bayesian
optimization technique. The overall working process is depicted in Figure 1. The following sub-sections explain
all the processes step-by-step.
links, URLs, special characters, punctuation, digits, and stop-words and apply tokenization, lower casing (for
English words), and stemming [12] operations. On the other hand, the “APT” contains the following four
pre-processing tasks:
− Removing thin-space unicode character: our dataset contains “ZERO WIDTH NON-JOINER”
(U+200C) Unicode character which creates garbage tokens when we use regular expression during tok-
enization. We have discarded this unicode character from the dataset.
− Fixing space error with delimiters and sentence-ending punctuation: in this dataset, we have found many
space related errors before and after delimiters like comma (“,”), semi-colon (“;”), and sentence-ending
punctuation like full-stop (“.”) or (“।”), question mark (“?”). These types of errors are corrected that reduces
the number of unwanted tokens.
− Replacing emoticons (ASCII characters) with Bengali words: we have developed an emoticon dictionary
which contains about 400 usable emoticons and their equivalent Bengali meaning. Here, we have grouped
the similar types of emoticons with the same Bengali word. E.g., the emoticons, “:)”, “:=)”, “:|)”, convey
the same Bengali meaning, “খুিশ” (smile). Entire emoticons of the dataset are substituted with corresponding
Bengali words. This replace operation may help for better understanding of the emotion [13] of a particular
text.
− Replacing emoji (unicode characters) with Bengali words: like the emoticons, we have replaced every emoji
by the equivalent Bengali words. Here, we also make an emoji dictionary of about 280 mostly used emoji
with their Bengali meaning and group the similar types of emoji with same the Bengali word.
To the best of our knowledge, the above four pre-processing tasks are slightly differed from those of
existing studies. We experiment on “APT” in section 3. Finally, we observe the impact of these pre-processing
tasks in the detection system.
Detecting cyberbullying text using the approaches with ... (Md. Nesarul Hoque)
362 r ISSN: 2252-8938
MNB assigns the probability for the classes of each data point and select a class with higher probability. The
probability, P (C|F ) is calculated using the Bayes theorem as (1).
P (C|F ) = P (F |C) ∗ P (C)/P (F ) (1)
Where, C is a possible class variable and F represents the unique feature.
Random forest (RF): the RF consist of a number of decision trees which are computed separately [23].
Here, the information gain is achieved for each tree through the “Gini Impurity” formula as follows:
∑
C
Gini = 1 − (Pi )2 (2)
n=1
where, C is the total number of possible class and Pi is the probability of ith class.
Logistic regression (LR): the working principle of this model is based on the two functions: sigmoid
(3) and cost (4), given below:
1
hθ (x) = (3)
1 + e−β T x
1∑
k
C(θ) = c(hθ (xi ), yi ) (4)
k i=1
{
− log(1 − hθ (x)), if y = 0
c(hθ (x), y) =
− log(hθ (x)), if y = 1
where, β T indicates the transpose of regression coefficient matrix (β), k represents the total amount of training
observations, hθ (xi ) is the hypothesis function of the ith training observation, and yi presents the actual output
of ith training sample.
Long short term memory (LSTM): this model is developed to handle the problem of long-term depen-
dency for the sequential and the time series data [24]. Here, four interactive layers: memory cell unit, forget
gate unit, input gate unit, and output gate unit are working together in each LSTM cell state. Information are
added or forgot in the memory cell state. Forget gate unit decides which information from the previous cell
state. In the input gate unit, new significant information are added into the memory cell state. Lastly, the output
gate unit decides which information are passed into the next LSTM cell state.
Bidirectional LSTM (BiLSTM): the LSTM model struggles to predict a word which more depend on
the future context word. To solve this issue, the concept of BiLSTM where one LSTM operates into forward
direction and another one flows into backward directions [25]. Here, the output of a particular BiLSTM cell
state is computed by combining the output of both LSTM cells for the same time-stamp.
Convolutional neural network with bidirectional LSTM (CNN-BiLSTM): the CNN-BiLSTM is a hy-
brid model, where CNN and BiLSTM models are executed sequentially. CNN operates with three fundamental
layers: convolution, pooling, and fully connected. In the convolution layer, element by element matrix multi-
plication operation (convolution operation) is performed between the input data and the kernel (feature detector)
and produce a feature map for each kernel. The dimension of this feature map is reduced in the pooling layer.
Finally the reduced feature map is passed through the fully connected layer which performs like the standard
artificial neural network (ANN) architecture. In the CNN-BiLSTM model, the BiLSTM part is added just be-
fore the fully connected layer. For the text processing task, CNN works on local dependency of the contiguous
words of a sentence by extracting character level features, while BiLSTM deals with overall dependency for a
particular feature of an entire text [26].
Bidirectional encoder representations from transformers (BERT): BERT shows an utmost performance
for context-specific tasks including language inference, question answering, among the other pre-trained models
like embeddings from language models (ELMo), and OpenAI generative pre-training (GPT). It works with
two separate frameworks: pre-training framework and fine-tuning framework. In the pre-train framework,
BERT obtains the contextual understand about the specified languages using two unsupervised tasks: mask
language model (MLM) and next sentence prediction (NSP). In contrast, the fine-tuning framework is applied
to several downstream tasks linked to natural language processing (NLP), such as sentiment classification,
sentence prediction, and named entity recognition, by simple adjustment of the task-specific architecture [14].
Detecting cyberbullying text using the approaches with ... (Md. Nesarul Hoque)
364 r ISSN: 2252-8938
Detecting cyberbullying text using the approaches with ... (Md. Nesarul Hoque)
366 r ISSN: 2252-8938
4. CONCLUSION
Within this study, we have experimented with pre-processing tasks, feature selection tasks, and selec-
tion of classifier models from three points of view: classical ML, DL, and transformer learning, and proposed
a concise decision in every aspect. Here, “APT” pre-processing task is introduced besides the general pre-
processing tasks. Then, we have shown the combined N-grams (character and word level) performs well for
the classical ML algorithms, while, fastText gives better results than the other two embedding techniques such
as word2vec and GloVe for the DL algorithms. Finally, using “APT”, BERT outperforms with an utmost accu-
racy of 80.165%. In future research, we will implement our proposed approaches to the other related datasets.
Furthermore, we will analyze other new pre-processing tasks and will extract new features. Finally, we will
experiment with various ML approaches (classical ML, DL, and transformer learning) separately and in a com-
bined form.
ACKNOWLEDGEMENTS
The Information and Communication Technology (ICT) division (Government of the People’s Repub-
lic of Bangladesh) provides a fellowship whose code number is 1280101-120008431-3821117 for the 2022-2023
fiscal year. This fellowship supports our entire research work.
REFERENCES
[1] D. L. Hoff and S. N. Mitchell, “Cyberbullying: causes, effects, and remedies,” Journal of Educational Administration, vol. 47, no. 5,
pp. 652–665, Aug. 2009, doi: 10.1108/09578230910981107.
[2] M. N. Hoque, P. Chakraborty, and M. H. Seddiqui, “The challenges and approaches during the detection of cyberbullying text for
low-resource language: A literature review,” ECTI Transactions on Computer and Information Technology (ECTI-CIT), vol. 17,
no. 2, pp. 192–214, Apr. 2023, doi: 10.37936/ecti-cit.2023172.248039.
[3] M. Ahmed, M. Rahman, S. Nur, A. Z. M. Islam, D. Das, “Introduction of PMI-SO integrated with predictive and lexicon based
features to detect cyberbullying in bangla text using machine learning,” in Proceedings of 2nd International Conference on Artificial
Intelligence: Advances and Applications, 2022, pp. 685–697, doi: 10.1007/978-981-16-6332-1_56.
[4] M. T. Ahmed, M. Rahman, S. Nur, A. Islam, and D. Das, “Deployment of machine learning and deep learning algorithms
in detecting cyberbullying in Bangla and romanized bangla text: A comparative study,” in 2021 International Conference
on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), Feb. 2021, pp. 1–10, doi:
10.1109/ICAECT49130.2021.9392608.
[5] A. K. Das, A. A. Asif, A. Paul, and M. N. Hossain, “Bangla hate speech detection on social media using attention-based recurrent
neural network,” Journal of Intelligent Systems, vol. 30, no. 1, pp. 578–591, Apr. 2021, doi: 10.1515/jisys-2020-0060.
[6] N. Romim, M. Ahmed, H. Talukder, and M. S. Islam, “Hate speech detection in the bengali language: A dataset and its baseline
evaluation,” in Proceedings of International Joint Conference on Advances in Computational Intelligence, 2021, pp. 457–468, doi:
10.1007/978-981-16-0586-4_37.
[7] T. Islam, N. Ahmed, and S. Latif, “An evolutionary approach to comparative analysis of detecting Bangla abusive text,” Bulletin of
Electrical Engineering and Informatics, vol. 10, no. 4, pp. 2163–2169, Aug. 2021, doi: 10.11591/eei.v10i4.3107.
[8] R. Kumar, B. Lahiri, and A. K. Ojha, “Aggressive and offensive language identification in Hindi, Bangla, and English: A comparative
study,” SN Computer Science, vol. 2, no. 1, pp. 1–20, Feb. 2021, doi: 10.1007/s42979-020-00414-6.
[9] T. Mandl et al., “Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in Indo-European
languages,” in Proceedings of the 11th forum for information retrieval evaluation, Dec. 2019, doi: 10.1145/3368567.3368584.
[10] S. Bhattacharya et al., “Developing a multilingual annotated corpus of misogyny and aggression,” arXiv preprint arXiv:2003.07428,
2020, doi: 10.48550/arXiv.2003.07428.
[11] P. Chakraborty and M. H. Seddiqui, “Threat and abusive language detection on social media in Bengali language,” in 2019 1st
International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), May 2019, pp. 1–6, doi:
10.1109/ICASERT.2019.8934609.
[12] M. H. Seddiqui, A. A. M. Maruf, and A. N. Chy, “Recursive suffix stripping to augment Bangla stemmer,” in International Confer-
ence Advanced Information and Communication Technology (ICAICT), 2016.
[13] M. Rahman and M. Seddiqui, “Comparison of classical machine learning approaches on bangla textual emotion analysis,” arXiv
preprint arXiv:1907.07826, 2019, doi: 10.48550/arXiv.1907.07826.
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understand-
ing,” arXiv preprint arXiv:1810.04805, 2018, doi: 10.48550/arXiv.1810.04805.
[15] S. Ahammed, M. Rahman, M. H. Niloy, and S. M. M. H. Chowdhury, “Implementation of machine learning to detect hate speech
in Bangla language,” in 2019 8th International Conference System Modeling and Advancement in Research Trends (SMART), Nov.
2019, pp. 317–320, doi: 10.1109/SMART46866.2019.9117214.
[16] M. Jahan, I. Ahamed, M. R. Bishwas, and S. Shatabda, “Abusive comments detection in Bangla-English code-mixed and translit-
erated text,” in 2019 2nd International Conference on Innovation in Engineering and Technology (ICIET), Dec. 2019, pp. 1–6, doi:
10.1109/ICIET48527.2019.9290630.
[17] O. Sharif and M. M. Hoque, “Automatic detection of suspicious bangla text using logistic regression,” in Intelligent Computing and
Optimization: Proceedings of the 2nd International Conference on Intelligent Computing and Optimization 2019 (ICO 2019), 2020,
pp. 581–590, doi: 10.1007/978-3-030-33585-4_57.
[18] E. A. Emon, S. Rahman, J. Banarjee, A. K. Das, and T. Mittra, “A deep learning approach to detect abusive Bengali text,” in
2019 7th International Conference on Smart Computing and Communications (ICSCC), Jun. 2019, pp. 1–5, doi: 10.1109/IC-
SCC.2019.8843606.
[19] M. M. Rayhan, T. A. Musabe, and M. A. Islam, “Multilabel emotion detection from bangla text using BiGRU and CNN-BiLSTM,”
in 2020 23rd International Conference on Computer and Information Technology (ICCIT), Dec. 2020, pp. 1–6, doi: 10.1109/IC-
CIT51783.2020.9392690.
[20] M. S. Jahan, M. Haque, N. Arhab, and M. Oussalah, “Banglahatebert: Bert for abusive language detection in Bengali,” in Proceedings
of the Second International Workshop on Resources and Techniques for User Information in Abusive Language Analysis, Jun. 2022,
pp. 8–15.
[21] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, Sep. 1995, doi:
10.1007/BF00994018.
[22] S. Xu, Y. Li, and Z. Wang, “Bayesian multinomial Naïve Bayes classifier to text classification,” in Advanced Multimedia and
Ubiquitous Engineering, Singapore: Springer, 2017, pp. 347–352, doi: 10.1007/978-981-10-5041-1_57.
[23] J. Ali, R. Khan, N. Ahmad, and I. Maqsood, “Random forests and decision trees,” International Journal of Computer Science Issues
(IJCSI), vol. 9, no. 5, pp. 272-278, Sep. 2012.
[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi:
10.1162/neco.1997.9.8.1735.
[25] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions on Signal Processing, vol. 45, no. 11,
pp. 2673–2681, 1997, doi: 10.1109/78.650093.
[26] M. R. Karim, B. R. Chakravarthi, J. P. McCrae, and M. Cochez, “Classification benchmarks for under-resourced Bengali language
based on multichannel convolutional-LSTM network,” in 2020 IEEE 7th International Conference on Data Science and Advanced
Analytics (DSAA), Oct. 2020, pp. 390–399, doi: 10.1109/DSAA49011.2020.00053.
[27] T.-T. Wong and P.-Y. Yeh, “Reliable accuracy estimates from k-fold cross validation,” IEEE Transactions on Knowledge and Data
Engineering, vol. 32, no. 8, pp. 1586–1594, Aug. 2019, doi: 10.1109/TKDE.2019.2912815.
[28] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” arXiv preprint arXiv:1607.01759,
2016, doi: 10.48550/arXiv.1607.01759.
[29] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, “Learning word vectors for 157 languages,” arXiv preprint
arXiv:1802.06893, 2018, doi: 10.48550/arXiv.1802.06893.
BIOGRAPHIES OF AUTHORS
Md. Nesarul Hoque is an assistant professor at the Bangabandhu Sheikh Mujibur Rahman
Science and Technology University in Bangladesh. He works in the Department of Computer Science
and Engineering. He received Bachelor of Science and Master of Science in Engineering degrees from
the University of Chittagong, Bangladesh. At present, he is pursuing a Doctor of Engineering degree
from the University of Chittagong, Bangladesh. He has some articles at renowned international con-
ferences. His research interests cover the areas of machine learning, deep learning, natural language
processing, and information retrieval. He can be contacted at email: mnhshisir@gmail.com.
Md. Hanif Seddiqui has completed his Master of Engineering and Doctor of Engineering
degrees from Toyohashi University of Technology, Japan with a number of research awards. He has
also completed his post-doctoral fellowship under the European Erasmus Mundus cLINK Programme
in the DISP Laboratory, University Lumiere Lyon 2, France, and a short fellowship from KDE Lab,
Toyohashi University of Technology, Japan. Currently, he is taking responsibility as a professor in the
Department of Computer Science and Engineering, University of Chittagong, Bangladesh. He has a
few more remarkable contributions on instance matching, Semantic NLP, healthcare and geospatial
domain using semantic technology as well. Dr. Seddiqui is a regular member of the conference
program committees in the areas of machine learning, data mining, and semantic web as well as a
reviewer of some journals. He can be contacted at email: hanif@cu.ac.bd.
Detecting cyberbullying text using the approaches with ... (Md. Nesarul Hoque)