You are on page 1of 15

Approaches of different steps from ‘A

comparative study of sentiment analysis


approaches’

The pre-processing stage of the texts is the most important in the analysis of feelings
because the messages in the social networks are characterized by expressions
colloquial, abbreviations, emoticons, a lengthening of words, a capital letter irregular
and they do not generally conform to the canonical grammatical rules.

According to this paper, we will address the different steps of preprocessing of Arabic
dialect in every referenced paper.

1) ‘Semantic Sentiment Analysis of Arabic Texts’: The main contribution of


this paper is representing tweets texts in their semantic space by taking into
account the semantic relationships between the words by utilizing the Arabic
WordNet instead of using only the bag of words features.
Preprocessing:

Adding tags: emoticons symbols and punctuation marks are replaced with
their corresponding meaningful word tags that represent their sentiment, for
example ‘:)’ is replaced by happy.

Data cleaning: removing items that do not include any sentiments such as
URLs, usernames, number etc…

Normalization: stripping diacritics, stripping lengthening etc…

Stop words removal: a list of stop words except for negations.


(https://code.google.com/p/stop-words/)

Concerning the feature extractor, the paper addressed the concepts


representation approach to capture the associations between words (AWN
ontology) by following two steps: identifying the concepts features by
mapping words in each tweet to their concepts, the WordNet return an ordered
list from most appropriate to least appropriate. Then, for words that have many
concepts, the most appropriate meaning is selected using the simplest WSD
strategies that were used in previous works like (‘WordNet improves text
document clustering’), (‘Using WordNet for text categorization’) and
(‘Sentiment analysis of twitter data using machine learning approaches and
semantic analysis’); and concepts incorporation, wherein the extracted
semantic concepts from tweets are incorporated as extra features to represent
the tweets.

The bag of words was used as well for emoticon symbols as extra features.

The final dataset consisted of about 826 tweet regarding different domains,
and was manually annotated. The experiments were conducted using a 10
fold cross validation method in order to ensure reliable results.

The results showed that using concepts features outperforms the baseline
BoW model. As far as the classifier model, SVM performed better than NB.
The highest F-measure value reached 95% when using AddC concepts
representation and SVM classifier.


2) ‘Sentiment Analysis of Moroccan Tweets using Naive Bayes Algorithm’:

This paper aims to classify the Moroccan tweets by means the NB, use topic
modeling (LDA) to discover the topics of classified tweets and finally locate
these tweets on Moroccan map according to their categories by the tool
‘Folium’.

Preprocessing:

Assuming that the emotion symbols in the tweet represent the overall
sentiment contained in that tweet. The tweets are filtered by a list of positive
and negative emotion symbols.

Create a dictionary of words by gathering them manually to transform words


written in Moroccan dialect, or dialect of Berber into the MSA.

Storing the words written using the Arabic or French alphabet in each slave
node of the cluster.

The preprocessing steps:

Delete unnecessary data: usernames, emails, hyperlinks, retweets,


punctuation, possessives from a noun, duplicate characters, and special
characters like smileys.

Shorten any elongated words

Normalize whitespace

Convert hashtags into separate words

Create function to detect the language used to write the text of tweet

Create function for automatic correction of spelling mistakes


Create a list of contractions to normalize and expand words

Delete the suffix of a word until we find the root

Remove tokens of part of speech that are not important to the analysis by
using the part of speech software

Remove stop words of MSA

During the experiment, the author collected a sample of 700 tweets for training
and 300 tweets for testing. Then, he applied NB as a classifier; afterwards
LDA was used for the classified tweets for each category.

This approach reached an accuracy of 69%, which is considered as a good


value in this case. Unless, the author could have increased the size of the
training set to improve the model.

3) ‘Semantic sentiment analysis in Arabic social media’: The main


contribution of this paper is proposing an Arabic sentiment ontology for
Jordanian Arabic dialect which focuses on semantic relations between
sentiments and their instances and have two groups of positive and negative
words.
In order to compute sentiment for any tweet, the weight of the words are
calculated and summed up. Below are two parts of these ontologies:
Example:

To test the reliability of this model, 1100 tweets were collected from three
organizations in Jordan and were manually annotated.
The comparison results of manual and automatic classification results:
4) ‘Negation handling in sentiment analysis at sentence level’: This paper
addressed the problem of negation which affects the polarities of other words.
When a negation appears in a sentence it is important to determine the
sequence of words which are affected by the negation’s term. The only issue
to handle the negation is the scope of the latter which may be limited only to
the next word or may be extended up to other words following negation.
The negation may appear in two forms, ie. Morphological and syntactic
negations.
Scope of syntactic negations:

In order to determine negation scope in syntactic negations we proposed a


method, which uses three linguistic features with static window and considers
some expectations. The linguistic features include conjunction analysis,
punctuation marks and heuristics based on POS of negation term

Scope of diminishers and morphological negations

The diminisher negations are different from the syntactic negations because
they usually reduce the polarities of other words instead of completing
inverting the polarities. In order to determine the scope of diminishers we uses
the diminishers list (hardly, less, little etc…) to indicate the presence of such
type of negation in a sentence. The affected adjective or verb is determined
using the following two heuristics. 1) Adverb is usually used immediately
before or after an adjective or a verb, which it modifies. 2) If neither an
adjective nor a verb is immediately before or after the diminishers then it likely
to affect the nearby adjective or verb within the same clause. After
determining the affected adjective/verb, the polarity of this word is diminished
by using a reducing factor of 0.2.

In some cases the negation term and the negated opinionated word are
combined in a single word, e.g. In words such as end-less, impolite,
dishonest, non-cooperative, etc. This type of negation is called morphological
negation which can be formed by using either one of the nine prefixes (i.e. de-,
dis-, il-, im-, in-,ir-, mis-, non-, un-) or one suffix (i.e. -less) with a root word.
The improvement in performance by the proposed method is because of three
main reasons. First, the proposed method determines scope of different types
of negations more effectively. Second, an appropriate word sense
disambiguation method is adopted. Finally, all opinionated POS (i.e. adjective,
verb, adverb and noun) are considered while determining the polarity.

According to the experiments, the SA behaves better using the light stemmer,
the emoticons analysis, negation words and intensifiers by a range of 9 to 25
in accuracy.

Some references:

- “A Negation Handling Technique for Sentiment Analysis’


- "Negation Identification and Calculation in Sentiment Analysis"
- “Negation Handling in Sentiment Analysis at Sentence Level"

5) ‘PERFORMANCE EVALUATION OF AN ADOPTED SENTIMENT


ANALYSIS MODEL FOR ARABIC COMMENTS FROM THE FACEBOOK’

In this work, the author used a dataset of 340 MSA, 1402 colloquial and 1742
comments and applied several experiments using the preprocessing methods
below to evaluate the performance of these methods either individually or
combined. The performance of the sentiment analysis model was improved
when adopting such themes. The best performance for the sentiment analysis
model was achieved when using negations, and intensifiers. (98.2% for the
predicted positive, and 93.2% for the predicted negative). Finally the best
performance was for those comment written in MSA while the worst one for
those adopting the informal Arabic style.

Preprocessing

Rejection of stop words using a list of 607 stopwords


Normalization is used for removing and replacing a set of letters or non-letters like
elongations, numbers etc…
Light stemming that consists of removing both the Arabic prefixes and the Arabic
affixes

Negation handling: If a word is found in the negation list, the polarity of the
neighboring opinion word is computed by multiplying the score of opinion word
by -1. On the other hand, other method has proven to be effective in [12, 8]. In
this work only syntactic negations are handled by using the part of speech and
dependency tree in polarity calculation.

Checking of intensifiers: amplifiers and downtoners can increase and


decrease the sentiment value by 50% and -50% respectively regardless the
degree of weakness or strong of any intensifier item ([12, 8])

Example of intensifiers: sentence boundary detection system that uses the


punctuation marks such as ‘?’ for example is it true?????

Identification of emotions: using a list of positive and negative emoticons

Some important references:

- Salha Al-Osaimi and Muhammad badruddin, ‘Sentiment Analysis


Challenges of Informal Arabic Language’
- Muhammad Zubair Asghar, Aurangzeb Khan, Shakeel Ahmad, Maria
Qasim, and Imran Ali Khan, "Lexicon-enhanced Sentiment Analysis
Framework using Rule-Based Classification Scheme"
- Anna Jurek, Maurice D. Mulvenna and Yaxin Bi, "Improved lexicon-based
Sentiment Analysis for Social Media Analytics"
- Amira Shoukry, Ahmed Rafea , "Preprocessing Egyptian Dialect
Tweets for Sentiment Mining"
- Ahmed Y. Al-Obaidi, Venus W. Samawi, "Opinion Mining: Analysis of
Comments Written in Arabic Colloquial”
- S. R. El-Beltagy, "NileULex: A Phrase and W o r d L e v e l
S e n t i m e n t L e x i c o n f o r E g y p t i a n and Modern
Standard Arabic"
6) ‘Social Networks’ Text Mining for Sentiment Classification: The case of
Facebook’ statuses updates in the “Arabic Spring” Era’
The architecture that was proposed in this article includes six steps: raw
collection, lexicon development, data preprocessing, feature extraction and
sentiment classification.

Raw collection: a novel collection, which represents Facebook’s statuses,


updates Tunisian users (web application called ‘I Told You’

Lexicon development: three types of lexicons were created

- Acronyms’ lexicon: like LOL, which is labeled as positive

-Emotions’ lexicon: like  and 

-Interjections’ lexicon: like Wow and no way

Preprocessing:

-Removing stop words

-Stemming

Feature extraction:

-Bag of words and vector space model

-N-gram

-Part of speech

After applying two models with different combinations of feature extractors and
using 10 fold cross validation for evaluation:
As far as the model, it turns out that the SVM model outperformed NB.

In addition, the most of the published statuses on Facebook during the


revolution have a positive sentiment.

Some references:

‘Developing resources for sentiment analysis of informal Arabic text in social media’

7) ‘Sentiment analysis for dialectical Arabic’

This article addressed the topic of sentiment analysis of MSA and Jordanian
dialectical Arabic tweets; the main contribution of this latter consists of creating
a lexicon from the tweets by translating all the dialectical words into MSA by
Crowdsourcing tool.

The dataset has been annotated as well by three labels, which are positive,
negative and neutral.

Example of dialectical words and their corresponding MSA words:


As far as the tweet preprocessing, the author used:

Tokenizing

Stop words removal except negation

Converting emoticons to their corresponding words by using a specialized


mapping table, for example:

Khoja stemmer

Determining the weight of every token using the binary model where a token is
given a weight equals to 1 if it is present in the tweet under consideration or is
given a weight equals to 0 if the token is absent from the tweet.

Two classification models were performed on a dataset of 22550 tweets,


which are NB and SVM
Two versions of Dataset: a version without removal of dialectical words and
the second version, after replacing dialectical words with their corresponding
MSA words.

1) Without dialect lexicon results

2) With dialect lexicon


The results reveal that replacing dialectical words slightly increases the
performance of the models, but the neutral class was not improved.

Some references:
////

https://www.ijaiem.org/Volume2Issue5/IJAIEM-2013-05-26-063.pdf

Social Networks’ Text Mining for Sentiment Classification: The case of Facebook’
statuses updates in the “Arabic Spring” Era

This paper mainly addresses the usage of emoticons’ lexicon, interjections’ lexicon
and part of speech tagging as features in the preprocessing step

/////

https://www.researchgate.net/publication/233859560_Preprocessing_Eg
yptian_Dialect_Tweets_for_Sentiment_Mining

For preprocessing steps

////

You might also like