You are on page 1of 4

Synthesis on ‘Sentiment analysis of modern

standard Arabic and Egyptian dialectal Arabic


tweets’

This paper presents a hybrid approach for sentiment analysis of modern standard
Arabic and Egyptian dialectal Arabic using verbal and non-verbal cues in the form of
text and emojis.

Among the obstacles that the paper addresses are: the riche nature of the Arabic
language, the shape of each letter differs according to its position in a word, the
Arabic treats everything (singular and plural) as either first person, masculine or
feminine, made distinguishable by prefixes and suffixes, whereas in English, only
pronouns differ, Arabic resources are scarce in comparison to those for English,
posts on social media commonly contain unstructured text, poor grammar and
spelling mistakes, repeating letters, many adjectives are also given names, such as
‫ جميلة‬, sentiment bearing idioms and phrases may not contain any sentiment words.

The author managed to propose a hybrid sentiment and emotion analysis on MSA
and Egyptian dialectal Arabic, which is based on lexicon based techniques and
machine learning techniques. The distinguishing properties and contributions of the
proposed system are: besides the polarity of the tweets, the emotions reflected by
the tweets are also evaluated using Plutchik’s Wheel of Emotions, verbal and non-
verbal cues are considered in the analysis of polarity and emotion, an emoji lexicon
of 120 emojis is created with the aid of Plutchik’s Wheel of Emotions to evaluate the
non-verbal cues, dialectal Arabic, which differs from MSA is evaluated using a
colloquial Arabic lexicon.

As far as the preprocessing step, which is important due to the unstructured text
used in social media, it involves data cleaning and normalization. Data cleaning
includes the removal of non-Arabic characters, removal of punctuation and removal
of redundant spaces and lines. Data normalization ensures the uniformity of tweets
by removing diacritics, normalization of Alef, Yaa, Hamza and taa marboutah,
removal of repeated letters, removal of repeated intensifiers and negators,
replacement of URLs, Hastags, retweets and mentions, reduction of Multi-Word
sentiment terms, normalization of negations.
In order to train the machine learning classifier and predict the polarity of unknown
data, the extracted features are:

Feature Explanation
N-grams Unigrams and bigrams that occur at
least 5 times are extracted
Sentiment score -If a word is preceded by a negator, the
polarity of the word will be reversed
-If un intensifier occurs either before or
after, it doubles the score of the word
-The position of a polar word is
considered by following the equation:
scoreterm = scoreterm + scoreterm l-j/l where
l is the length of the tweet and j is the
position of the polar word.
-If emojis are present, their scores are
added to the total score.
-Each tweet is mapped to a scale by
dividing the total score by the number of
sentiment tokens.
Emotion label The same process of polarity calculation
is carried out for each of the 4 emotion
axes. Each axis represents the score for
its corresponding pair of opposite
emotions: JoySadness, FearAnger,
TrustDisgust and SurpriseAnticipation.
The assigned label is that of the emotion
with the highest score.
Emotion axes scores Scores for each of the 4 emotion axes
Number of sentiment tokens Count of sentiment words and emojis in
the tweet
Number of emotion tokens Count of emotion words and emojis
Has negations A Boolean value indicating the presence
of negations
Negation count Counts the number of negations present
in the tweet
HasPositiveSentiment, A Boolean value indicating the presence
HasNegativeSentiment of positive and negative tokens,
respectively.
Token count The number of token in the tweet.

The final step of the algorithm is to perform binary classification of the generated
feature vector using two experiments. Before each classification, the order of the
data is randomized and to ensure that there is no bias towards any class, the
SMOTE filter is used. For all experiments, 10-fold cross validation is used.
In the first experiment, the dataset used is collected by Mourad et al. (‘Subjectivity
and sentiment analysis of modern standard Arabic and Arabic microblogs’). The
different classification algorithms tested are: Bagging ensemble classifier with SVM
with a linear kernel with C = 16, Random forest tree classifier and Bagging ensemble
classifier with random forest.

In the second experiment, the dataset used is constructed by Shoukry et al.


(‘Sentence-level Arabic sentiment analysis’). The algorithms used are: SVM and
Naïve bayes.

The results of the experiments:

We can conclude that the proposed algorithm outperforms much better in


comparison to the other works.

Some references:

1) Context based Arabic stemmer by EL-Defrawy


2) Sentiment analysis of colloquial Arabic tweets by EL-Makky
3) Mining social networks Arabic slang by R. Hedar
4) A rule based approach to polarity detection by E. Tromp
5) Arabic part of speech tagger by S. khoja
6) A fully automated approach for Arabic slang lexicon extraction from microblogs
by H. ElSahar
7) Building a wordnet for Arabic by S. ELkateb

You might also like