You are on page 1of 20

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

"JNANASANGAMA",MACHHE,BELAGAVI-590018

A Technical Seminar Report


On
Detecting Arabic Textual Threats In Social Media Using Artificial
Intelligence
Submitted in partial fulfillment of the requirements for the VIII semester
Bachelor of Engineering
in
Information Science and Engineering
of
Visvesvaraya Technological University, Belagavi.
By
Kowsalya R
(1CD20IS053)

Under the Guidanceof


Prof.Santosh M
Assistant Professor
Department of ISE

Department of Information Science and Engineering


CAMBRIDGE INSTITUTE OF TECHNOLOGY, BANGALORE-560036
2023-2024
CAMBRIDGEINSTITUTEOFTECHNOLOGY
K.R.Puram,Bangalore-560036
DEPARTMENTOFINFORMATIONSCIENCE&ENGINEERING

CERTIFICATE

Certified that Ms.Kowsalya.R bearing USN 1CD20IS053 ,a bonafide student of Cambridge


Institute of Technology, has successfully completed technical seminar entitled “Detecting
Arabic Textual Threats In Social Media Using Artificial Intelligence “in partial fulfillment of
the requirements for VIII semester Bachelor of Engineering in Information Science and
Engineering of Visvesvaraya Technological University, Belagavi during academic year 2023-
2024. It is certified that all Corrections/Suggestions indicated for Internal Assessment have been
incorporated in the report deposited in the departmental library. The seminar report has been
approved as it satisfies the academic requirements in respect of technical seminar prescribed for
the Bachelor of Engineering degree.

----------------------------- ----------------------------- --------------------------


Seminar Guide Seminar Co-Ordinator Head of the Dept.
Prof. Santosh M Prof.Anusha B Dr.Preethi S
Assistant Professor Assistant Professor Associate Professor
Dept. of ISE, CiTech Dept. of ISE, CiTech Dept. of ISE, CiTech
ACKNOWLEDGEMENT

I would like to place on record my deep sense of gratitude to Shri. D. K. Mohan, Chairman,
Cambridge Group of Institutions, Bangalore, India for providing excellent Infrastructure and

Academic Environment at CITech without which this work would not have been possible.

I am extremely thankful to Dr. G. Indumathi, Principal, CITech, Bangalore, for providing me


the academic ambience and everlasting motivation to carry out this work and shaping our careers.

I express my sincere gratitude to Dr. Preethi.S, HOD, Dept. of Information Science and
Engineering, CITech, Bangalore, for her stimulating guidance, continuous encouragement and
motivation throughout the course of present work.

I also wish to extend my thanks to Ms, Anusha.B, Assisstant Professor, Seminar Coordinator,
Dept. of ISE, CITech, Bangalore, for there critical, insightful comments, guidance and
constructive suggestions to improve the quality of this work.

I also wish to extend my thanks to Mr,Santosh. M, Assistant Professor, Dept. of ISE,CITech for
his guidance and impressive technical suggestions to complete my seminar.

Finally to all my friends, classmates who always stood by me in difficult situations alsohelped
me in some technical aspects and last but not the least I wish to express deepest sense of
gratitude to my parents who were a constant source of encouragement and stood by me as pillar
of strength for completing this work successfully.

Kowsalya R
ABSTRACT

Recent studies show that social media has become an integral part of everyone's daily routine.
People often use it to convey their ideas, opinions, and critiques. Consequently, the
increasing use of social media has motivated malicious users to misuse online social media
anonymity. Thus, these users can exploit this advantage and engage in socially unacceptable
behavior . The use of inappropriate language on social media is one of the greatest societal
dangers that exist today. Therefore, there is a need to monitor and evaluate social media
postings using automated methods and techniques. The majority of studies that deal with
offensive language classification in texts have used English datasets. However, the
enhancement of offensive language detection in Arabic has gotten less consideration. The
Arabic language has different rules and structures. This article provides a thorough review of
research studies that have made use of artificial intelligence (AI).
CONTENTS

Abstract i

Contents ii

List ofFigures iii

Chapters PageNo.

Chapter1 Introduction 1

Chapter2 Literature Survey 3

Chapter3 Experimental Set-up 5

Chapter4 Methodology 7

Chapter5 RealWorldApplications9

5.1 Enhancing Public safety 9


5.2 Protecting Individuals and Communities 9
5.3 Improving Social Media Experiance 10
Conclusion 11

References

v
List of Figures

Figure No. Figure Name PAGE NO.

4.1 Applications of Artificial Intelligence on Social Media 7

Dept.of ISE,CITech 2023-24 Page1


CHAPTER1

INTRODUCTION

Individuals are getting increasingly engaged with one another as a result of the growth of social networks
during the last few decades. People from all over the globe were given the chance to communicate on a
massive scale and in real-time using micro blogging technologies. Humans now have the ability to
communicate freely, allowing them to share a wide range of ideas, feelings, and information. Furthermore,
users of these platforms may prefer to remain anonymous, raising the risk of technical misuse. As a result,
offensive languages of diverse kinds, such as hate speech and cyber bullying, have become more widespread
on social media.
According to legislation, hate speech on social networking platforms is prohibited in certain nations. In
Germany, for example, the Network Enforcement Act was issued in 2017. Moreover, legislative amendments
currently attempt to combat offensive language. Advanced technical approaches that can aid social media
platforms and others in implementing these laws. Online offensive language spotting has been used in
multiple languages, such as English, German, Turkish, Hindi, Chinese, and Arabic. Working with Arabic may
be difficult because of morphological complexity and the lexical ambiguity of Arabic. Another issue is that
the Arabic language includes a wide range of dialects. In this article, we focus on the implemented artificial
intelligence approaches applied, quality measurement performance, and dataset details (source, dialect,
annotation methodology) used for offensive detection in Arabic language. Future studies will be guided by
this, since it will provide researchers with a more uniform and compatible viewpoint on the issue. The rest of
the article is structured as follows. Offensive language types, Arabic language issues, data preparation steps,
feature representation techniques, AI approaches and related work are presented in section 2. In section 3, we
look into Arabic datasets that have been used in previous studies. Section 4 discusses significant works and
ongoing research in the area of Arabic social threat detection, as shown in section 5, which comprises an
evaluation of the results and a discussion. Finally, in section 6, the conclusion is demonstrated.

Dept.of ISE,CITech 2023-24 Page1


Detecting Arabic Textual Threats In Social Media Using Artificial Intelligence Introduction

1.1Problem Definition

AI-powered Arabic Threat Detection on Social Media focuses on utilizing artificial intelligence to analyze
social media content in Arabic and identify potential threats. This can involve:

Natural Language Processing (NLP): AI can be trained to understand the nuances of Arabic language,
including slang, dialects, and sarcasm, to identify keywords and phrases that might indicate threats.

Machine Learning: AI algorithms can learn from vast amounts of social media data to recognize patterns
associated with threats, such as specific mentions of violence or calls to action.

Sentiment Analysis: AI can analyze the overall sentiment of a post to gauge the potential for hostility or
negativity.

This technology can be a valuable tool for identifying potential threats early on, but it's important to
remember that AI is not perfect. It's crucial to have human oversight to ensure accuracy and avoid flagging
harmless content.

Dept.of ISE CITech 2023-24 Page2


CHAPTER2

LITERATURE SURVEY

2.1 H Elzayady, MS Mohamed, KM Badran, GI Salama

Indonesian Journal of Electrical Engineering and Computer Science, 2022•academia.edu

People often use it to convey their ideas, opinions, and critiques. Consequently, the increasing use of social
media has motivated malicious users to misuse online social media anonymity. Thus, these users can exploit
this advantage and engage in socially unacceptable behavior. The use of inappropriate language on social
media is one of the greatest societal dangers that exist today. Therefore, there is a need to monitor and
evaluate social media postings using automated methods and techniques. The majority of studies that deal
with offensive language classification in texts have used English datasets. However, the enhancement of
offensive language detection in Arabic has gotten less consideration. The Arabic language has different rules
and structures. This article provides a thorough review of research studies that have made use of artificial
intelligence (AI) for the identification of Arabic offensive language in various contexts.

2.2 Haddad et al. [51] presented Tunisian hate and abusive speech (T-HSAB) dataset.

Many political, social, religious, women's rights, and immigration problems were addressed by T-HSAB.
Unfortunately, the authors did not specify which online sources they selected as a data source, although they
did make it clear that the data was collected from social media sites between October 2018 and March 2019.
Number of rows of the dataset was 6,075, including 3,834 normal commentaries, 1,127 abusive
commentaries, and 1,078 hate commentaries. Mubarak et al. [52] introduced open-source Arabic corpora and
corpora processing tools (OSACT) dataset. Tweets were chosen based on 2 factors: Tweets containing the
vocative particle A/O and released between April 15 and May 6, 2019. This publicly available OSACT 2020
Arabic dataset consisted of 10,000 thoroughly annotated tweets. For each tweet, a 2-level hierarchy was used.
The highest degree of labeling was binary: either offensive or not offensive. Only 1,900 tweets out of 10,000
were offensive, and only 95 of those were hate speech.

Dept.of ISE CITech 2023-24 Page3


Detecting Arabic Textual Threats In Social Media Using Artificial Intelligence LiteratureSurvey

2.3 Dr. Tarek Abd El-Hafeez, Tarek Mahmoud


Automatic Detection of Cyber bullying and Abusive Language in Arabic Content on Social
Networks: A Survey
As a key player in today’s world, Online social networks are emerging, providing a platform for expression
and content distribution. This technology enables users to communicate easily with each other and share
their data instantly .How ever, the internet isn’t generally protected; it can be a source for abusive and
harmful content and causing harm to others. There is a great need for approaches and strategies to solve
these issues due to the negative effect of abusive language and cyber bullying. Arabic text is known for its
challenges, complexity, and scarcity of its resources. Many languages have made many efforts to find
automated solutions for detecting abusive language and cyber bullying , but not much for Arabic language

2.4 Mubarak et al. [48] provided MSA dataset for the purposes of identifying racist, sexist, abusive
attacks, instigating, and irrelevant comments from Aljazeera.net users.

Only the shorter comments (3 to 200 characters) were retained, reducing the final dataset to 32,000 remarks.
Professional annotators categorized the dataset into three categories: obscene, offensive, and clean. This
research also offered another dataset, which included 1,100 Egyptian tweets. The use of a dialectal Arabic
dataset for offensive language was unprecedented, despite the fact that the dataset contained just a limited
number of tweets. Albadi et al. [49] utilized Twitter data to build the first Arabic religious hate speech
corpus. The data was extracted via the use of Arabic keywords and includes the six most significant:
Christian, Islamic, Sunni, Shia, Jewish, and Atheist views. The training dataset consisted of 6,000 tweets;
1,000 of them indicated each religion or belief, whereas the testing dataset comprised 600 tweets; 100 of
those represented each religion or belief. Every tweet is associated with two levels of labels. Initially,
annotators were tasked with assigning tweets to one of three categories: hate, neutral, or non-relevant (later
excluded). Next, the hateful tweets in the class were given seven different labels: Shia, Sunnis, Muslims,
Jews, Christians, atheists, and others. The Levantine Twitter dataset for hate speech and abusive language
(L-HSAB) was publicly available in [50]. It is thought to be the first Arabic hate speech dataset focusing on
the Levantine area. Political issues were the main theme of the dataset. The L-HSAB included 5,846 tweets,
3,650 of which were classified as "normal," 1,728 of which were categorized as "abusive," and 468 of which
were labeled as "hate". Three native Levantine speakers from the region provided annotations for the
dataset.

Dept.of ISECITech 2023-24 Page4


CHAPTER3

EXPERIMENTALSETUP

This experiment aims to develop an AI system using deep learning techniques to identify potential
threats in Arabic social media content. Here's a breakdown of the setup:

Data Collection:

Arabic Social Media Dataset: Gather a large corpus of Arabic text data from social media platforms
(Twitter, Facebook etc.) containing both threat-related and non-threatening content.

Ensure the data is balanced, with enough examples of each category.

Consider partnering with social media platforms or obtaining publicly available datasets.

Threat Labeling: Annotate the collected data. Label each post/message as either "Threat" or "Non-
Threat".

Leverage human annotators with expertise in Arabic language and threat identification.

Model Development:

Preprocessing: Clean the Arabic text data by removing noise (URLs, hashtags, punctuation) and
normalizing dialects.

Apply Arabic text segmentation (separating words) and tokenization (breaking words into
characters).

Deep Learning Architecture: Choose a deep learning architecture suitable for text classification, such
as Long Short-Term Memory (LSTM) networks or Convolutional Neural Networks (CNNs) with an
embedding layer.

The embedding layer learns vector representations for Arabic words, capturing semantic
relationships.

Train the model on the labeled data, allowing it to learn patterns that differentiate threatening and
non-threatening content.

Evaluation:

Once trained, evaluate the model's performance on a separate test dataset not used for training.

Use metrics like accuracy, precision, and recall to assess the model's ability to identify threats
effectively

Deptof ISE.,CiTech 2023-24 Page5


Detecting Arabic Textual Threats In Social Media Using Artificial Intelligence Experimental Set-up

Deployment:

Real-time monitoring: Integrate the trained model into a system that monitors live Arabic social
media feeds.

The model analyzes incoming posts and flags those with a high probability of containing threats.

Human Review: The flagged posts are then directed for human review by analysts who can assess
the context and intent behind the content and make a final determination.

Additional Considerations:

Bias Mitigation: Train the model on diverse datasets to minimize bias towards specific dialects or
threat types.

Regularly monitor the model's performance for bias and retrain as necessary.

Explainability: Consider using techniques like Layer-wise Relevance Propagation (LRP) to


understand why the model classifies certain content as a threat.

This can improve trust and transparency in the system.

Legal and Ethical Implications: Ensure compliance with data privacy regulations and ethical
guidelines for AI use in social media monitoring.

Dept.ofISE.,CiTech. 2023–24 Page6


CHAPTER4

METHODOLOGY

Deep dive into a methodology for identifying Arabic threats on social media using AI

Data Acquisition and Preprocessing:

Data Collection: Partner with social media platforms or utilize public data sets containing Arabic text
to train your AI model.

Arabic Language Processing (ALP): Apply NLP techniques specific to Arabic, like Named Entity
Recognition (NER) for identifying locations and people, and Part-of-Speech (POS) tagging to
understand sentence structure.

Threat-Specific Lexicon Development: Create a lexicon of Arabic words and phrases commonly
associated with threats, violence, hate speech, and extremism. This lexicon can be built from open-
source threat intelligence reports and manually curated by Arabic language experts

DeptofISE.,CiTech 2023-24 Page7


Detecting Arabic Textual Threats In Social Media Using Artificial Intelligence Methodology

Data Cleaning and Augmentation: Clean the data by removing irrelevant information like URLs and
hashtags. Consider data augmentation techniques like back-translation to artificially expand your
training dataset.

AI Model Development and Training:

Deep Learning Models: Utilize deep learning architectures like Long Short-Term Memory (LSTM)
networks or Transformers, which excel at capturing context in sequential data like text. These
models can be trained on labeled data where Arabic text is categorized as threatening or non-
threatening.

Transfer Learning: Leverage pre-trained Arabic language models (ALMs) like AraBERT and fine-
tune them on your specific threat detection task. This can significantly reduce training time and
improve accuracy.

Multi-modal Analysis (Optional): Explore incorporating additional data points like user profiles,
location information, and image recognition to enrich the model's understanding of potential threats.

Threat Detection and Analysis:

Real-time Monitoring: Deploy the trained AI model to analyze live streams of Arabic text on social
media platforms. The model assigns a threat score to each post based on the likelihood of containing
threatening content.

Alerting and Human Review: Set thresholds for the threat score. Posts exceeding the threshold
trigger alerts for human analysts to review and determine the appropriate course of action. This
ensures a balance between automation and human judgment.

Continuous Improvement: Continuously monitor the model's performance and retrain it with new
data to improve accuracy over time. Analyze false positives and negatives to refine the threat
lexicon and adjust model parameters

Additional Considerations:

Dialect Variation: Account for the various dialects of Arabic by training the model on a diverse
dataset or developing dialect-specific models.

Cultural Context: Be mindful of cultural nuances that might influence the interpretation of
potentially threatening language. Human oversight is crucial to avoid misinterpretations.

Transparency and Explainability: Develop mechanisms to explain the AI model's reasoning behind
threat classifications. This fosters trust and allows for adjustments if biases are identified.

Dept.of ISE.,CiTech. 2023–24 Page8


CHAPTER5

REALWORLDAPPLICATIONS

5.1Enhancing Public Safety

Threat Identification:

Social media platforms can leverage AI to automatically flag posts containing Arabic text that
suggests violence, terrorism, or other harmful activities. This allows for quicker intervention by
moderators or law enforcement.

Crisis Management:

During emergencies or developing situations, AI can be used to monitor Arabic social media
conversations for threats or calls to violence. This can aid authorities in taking necessary
precautions and mitigating potential harm.

5.2 Protecting Individuals and Communities

Combating Hate Speech:

AI can identify hateful or discriminatory content written in Arabic. This can help social media
platforms create safer spaces for online discourse and protect vulnerable communities .

Countering Radicalization:

By analyzing online conversations, AI can potentially detect individuals who may be susceptible to
radicalization. This information can be used to provide support or intervention programs.

Dept.of ISE.,CiTech. 2023–24 Page9


Detecting Arabic Textual Threats In Social Media Using Artificial Intelligence Real world application

5.3 Improving Social Media Experience:

Content Moderation: AI can streamline the process of moderating social media content by
automatically filtering out Arabic text that violates platform guidelines. This frees up human moderators
to focus on more complex issues.

Personalized User Experience: Social media platforms can use AI to personalize user experience by
filtering out potentially harmful or offensive content written in Arabic, creating a more positive online
environment.

It's important to remember that AI for threat detection is still under development. Challenges include the
complexity of Arabic language with dialects and slang, and the need for human oversight to avoid
flagging legitimate content. However, this technology holds significant promise for making social media
a safer and more positive space for everyone.

CONCLUSION
Dept.of ISE CITech 2023-24 Page 11
Lately, many researchers have recently become interested in detecting Arabic offensive language on social
networks using artificial intelligence. The suggested methods to identify the issue of Arabic offensive language
are discussed in this article, which includes various forms of offensive language like hate speech and cyber
bullying. Included are the techniques utilized, performance metrics, and dataset characteristics (dialect, annotation
method, and platform). In the study's findings, it is shown that the topic study is in its initial stages, and most
techniques have not yet been used to identify a practical classification system for Arabic text. Even yet, only a
small number of Arabic datasets are available for offensive categorization. As a consequence, this work is
challenging due to the restricted amount of datasets, complex pre-processing procedures, and a lack of
publications in this area.

Dept.of ISE CITech 2023-24 Page 11


REFERENCES

[1] N. Gholipour, E. Arianyan, and R. Buyya, ‘‘A novel energy-aware resource management
technique using joint VM and container consolidation approach for green computing in cloud data
centers,’’ Simul. Model. Pract. Theory, vol. 104, Nov. 2020, Art. no. 102127.
[2] S. Habib, G. Abbas, T. A. Jumani, A. A. Bhutto, S. Mirsaeidi, and E. M. Ahmed, ‘‘Improved
whale optimization algorithm for transient response, robustness, and stability enhancement of an
automatic voltage regulator system,’’ Energies, vol. 15, no. 14, p. 5037, Jul. 2022.
[3] S. Zhou, D. He, W. Gu, Z. Wu, G. Abbas, Q. Hong, and C. Booth, ‘‘Design and evaluation of
operational scheduling approaches for HCNG penetrated integrated energy system,’’ IEEE
Access, vol. 7, pp. 87792–87807, 2019.
[4] M. I. K. Khalil, I. Ahmad, S. A. A. Shah, S. Jan, and F. Q. Khan, ‘‘Energy cost minimization
for sustainable cloud computing using option pricing,’’ Sustain. Cities Soc., vol. 63, Dec. 2020,
Art. no. 102440.
[5] I. Hamzaoui, B. Duthil, V. Courboulay, and H. Medromi, ‘‘A survey on the current challenges
of energy-efficient cloud resources management,’’ Social Netw. Comput. Sci., vol. 1, no. 2, pp. 1–
28, Mar. 2020.
[6] M. J. Usman, A. S. Ismail, G. Abdul-Salaam, H. Chizari, O. Kaiwartya, A. Y. Gital, M.
Abdullahi, A. Aliyu, and S. I. Dishing, ‘‘Energy-efficient nature-inspired techniques in cloud
computing datacenters,’’ Telecommun. Syst., vol. 71, no. 2, pp. 275–302, Jun. 2019.

You might also like