You are on page 1of 4

Synthesis on ‘Sentiment analysis for modern

standard Arabic and colloquial’

Sentiment analysis refers to the use of natural language processing, text analysis
and computational linguistics to identify and extract subjective information in the
source materials.

There are mainly two main approaches for sentiment classification, namely: machine
learning approach and semantic orientation approach. In this paper, a semi-
supervised approach is used for sentiment analysis using a high coverage Arabic
sentiment words lexicon, which is automatically increased, sentiment idioms phrase
lexicon to improve the classification process, and SVM as a classifier with
linguistically and syntactically features.

According to the paper, the complexity consists of the fact that all the natural
language processing approaches that has been applied to most languages, are not
valid for applying on Arabic language directly, which leads in the first place to
sentiment lexicon construction, dealing with negation and a specific preprocessing
according to the dialect we deal with.

The authors built their own corpus containing 2000 Arabic sentiment statements that
includes 1000 MSA tweets, Arabic dialect tweet and 1000 microblogs from Twitter
API and different forum websites such as http://www.booking.com,
http://forum.fatakat.com, http://ejabat.google.com, etc. The data is annotated
manually by native speakers and then evaluated in terms of Kappa to determine the
quality of annotations.

The main contribution of this paper is creating a 5244 sentiment adjectives lexicon
with their polarity values ‘ArSeLEX’, which is automatically expandable by collecting
synonyms and antonyms of each word using different Arabic dictionaries and label
each word, as well as building a 12785 wisdoms and idioms
(http://proz.com/glossary-translations/) from which a 3296 phrases were selected and
annotated. To make use of this approach, a heuristic rule was applied to the topic
that contains idioms to prevent redundancy in classification process by replacing
known idioms with text masks.
The SAS has two parts: the first part includes the preprocessing and lexicon
expansion and the second part includes the features extraction and classification.
The preprocessing includes data cleaning, stop words removal and normalization.

Lexicon expansion and orientation detection :

As far as feature engineering, different features to the SAS were employed such as
standard features, which define if it is positive or negative, sentence level features,
which heavily affect the classification accuracy using two features: (1) term frequency
that consists of increasing the polarity strength of the polar word, and (2) polar word
position, which increases the feature by the value of position of the polar word in the
sentence, which is calculated by (number of sentence words / word position) and (3)
number of words for improving the classification accuracy against long statements.
To handle negation, two features were added (4) Is_negation that indicates the
existence of negation and (5) N_O_Negation for expressing number of negation.
Contextual intensifiers were used to emphasize the sentiment polarity of the word. (6)
Is_Question (7) N_O_Question, questions are considered as a negative sentiment,
because the writer feels sad or angry or wonder due to the lack of information. The
author considered supplication and wishful as a negative sentiment (8) Is_wishful (9)
N_O_wishful. To avoid conflicting phrases, POS tag n-gram patterns were used to
detect the inflection phrases such as a positive noun followed by a negative
adjective, which using only polarity lexicon will be considered as positive wherein it
expresses a negative sentiment.

In this study, SVM with linear kernels has yield the best performance for this work.

After dividing the data (2000 topics) that consists of four types into 80% for training,
10% for developing and 10% for testing, two experiments were applied, the first one
with and without the expansion algorithm, and the second one with new collection of
data (400 topics) , (before and after expansion). The results show that the lexicon
expansion had a large effect on sentiment classification with improvements ranging
between 1-4% for all types of data.

Some references:

Bootstrapping Sentiment Labels For


Unannotated Documents With
Polarity PageRan

Bootstrapping Sentiment Labels For


Unannotated Documents With
Polarity PageRan
Bootstrapping Sentiment Labels For
Unannotated Documents With
Polarity PageRan
1) Bootstrapping sentiment labels for unannotated documents with polarity
PageRank
2) A system for subjectivity and sentiment analysis of Arabic social media
3) Subjectivity and sentiment analysis of modern standard Arabic and Arabic
microblogs
4) Towards an optimal POS tag set for modern standard Arabic processing
5) ‘O.F.Zaidan and C.Callison-Burch’ Arabic dialect identification

You might also like