Professional Documents
Culture Documents
Textbook Intelligent Natural Language Processing Trends and Applications 1St Edition Khaled Shaalan Ebook All Chapter PDF
Textbook Intelligent Natural Language Processing Trends and Applications 1St Edition Khaled Shaalan Ebook All Chapter PDF
https://textbookfull.com/product/natural-language-processing-1st-
edition-jacob-eisenstein/
https://textbookfull.com/product/python-natural-language-
processing-advanced-machine-learning-and-deep-learning-
techniques-for-natural-language-processing-1st-edition-jalaj-
thanaki/
https://textbookfull.com/product/natural-language-processing-and-
computational-linguistics-2-semantics-discourse-and-
applications-1st-edition-mohamed-zakaria-kurdi/
https://textbookfull.com/product/deep-learning-in-natural-
language-processing-deng/
https://textbookfull.com/product/cognitive-approach-to-natural-
language-processing-1st-edition-bernadette-sharp/
https://textbookfull.com/product/natural-language-processing-
with-python-cookbook-1st-edition-krishna-bhavsar/
https://textbookfull.com/product/deep-learning-in-natural-
language-processing-1st-edition-li-deng/
https://textbookfull.com/product/natural-language-processing-for-
global-and-local-business-1st-edition-fatih-pinarbasi/
Studies in Computational Intelligence 740
Khaled Shaalan
Aboul Ella Hassanien
Fahmy Tolba Editors
Intelligent
Natural Language
Processing: Trends
and Applications
Studies in Computational Intelligence
Volume 740
Series editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
e-mail: kacprzyk@ibspan.waw.pl
The series “Studies in Computational Intelligence” (SCI) publishes new develop-
ments and advances in the various areas of computational intelligence—quickly and
with a high quality. The intent is to cover the theory, applications, and design
methods of computational intelligence, as embedded in the fields of engineering,
computer science, physics and life sciences, as well as the methodologies behind
them. The series contains monographs, lecture notes and edited volumes in
computational intelligence spanning the areas of neural networks, connectionist
systems, genetic algorithms, evolutionary computation, artificial intelligence,
cellular automata, self-organizing systems, soft computing, fuzzy systems, and
hybrid intelligent systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
123
Editors
Khaled Shaalan Fahmy Tolba
The British University in Dubai Faculty of Computers and Information
Dubai Ain Shams University
United Arab Emirates Cairo
Egypt
Aboul Ella Hassanien
Faculty of Computers and Information
Technology
Cairo University
Giza
Egypt
v
vi Preface
that contributed to the staging of this edited book (authors, committees, and
organizers). We deeply appreciate their involvement and support that was crucial
for the success of the “Intelligent Natural Language Processing: Trends and
Applications” edited book.
vii
viii Contents
Part XI E-learning
Intelligent Text Processing to Help Readers with Autism . . . . . . . . . . . . 713
Constantin Orăsan, Richard Evans and Ruslan Mitkov
Education and Knowledge Based Augmented Reality (AR) . . . . . . . . . . . 741
Salwa Hamada
A Tutorial on Information Retrieval Using Query Expansion . . . . . . . . . 761
Mohamed Yehia Dahab, Sara Alnofaie and Mahmoud Kamel
Part I
Sentiment Analysis
Using Deep Neural Networks
for Extracting Sentiment Targets
in Arabic Tweets
1 Introduction
A. El-Kilany ⋅ A. Azzam
Faculty of Computers and Information, Cairo University, Giza 12613, Egypt
e-mail: a.elkilany@fci-cu.edu.eg
A. Azzam
e-mail: a.tarek@fci-cu.edu.eg
S.R. El-Beltagy (✉)
Center for Informatics Science, Nile University, Giza 12588, Egypt
e-mail: samhaa@computer.org
Kentucky lost to Tennessee!” there are two entities: ‘Kentucky’ and ‘Tennessee’.
The overall sentiment of this tweet is positive and as such the sentiment target
recognizer must identify the target of this positive sentiment which is “Tennessee”.
Sentiment target recognition in this setting is a specific case of a more general
framework which assumes that each entity in the text may have its own sentiment.
In this paper, we are proposing a solution to find targets of the overall text sentiment
rather than breaking down the text into multiple sentiments with multiple targets for
each sentiment. Such a setting allows us to tackle the sentiment target recognition
problem as a specific case of named-entity recognition.
In this paper, we propose two models for the sentiment target recognition task.
All our experiments are carried out on Arabic text. The first model, is a baseline one
which utilizes a named entity recognizer [1, 2] and a sentiment analyzer [3] to
recognize the target entities of some given text’s sentiment. The second model is a
deep neural network for named entity recognition which is provided with target
named entities in the training phase in order to identify them during the extraction
phase.
The deep neural network which we have used, consisted of a bidirectional
LSTM network [4] with a conditional random fields [5] classifier as a layer on top
of it. A word2vec model for Arabic words was built in order to represent the word
vectors in the network. The deep neural network model has shown a noticeable
advantage over the baseline model in detecting sentiment targets over a dataset of
Arabic tweets. Figures 1 and 2 depict the components of both baseline and deep
neural network models.
The rest of this paper is organized as follows: Sect. 2 provides a background
with respect to tasks that are related to sentiment target recognition; Sect. 3
describes the collection and annotation of Arabic tweets; Sect. 4 details the process
of building the word embeddings model while Sect. 5 describes both the baseline
model and the deep neural model; Sect. 6 presents the evaluation of the proposed
model through comparison of its performance against the baseline model while
Sect. 7 concludes this paper.
2 Background
tweets will be ambiguous, while others will have wrong entities. The next step
following the collection of this dataset, was revising and correcting the automati-
cally assigned labels, as well as specifying the sentiment target for each tweet. To
facilitate this process, a web based, user friendly application was developed. In
cases where the assigned sentiment was wrong or where the detected entity was
incorrect, the annotator was asked to correct it. The annotator was also asked to
specify the target entities for each tweet either by selecting from already identified
entities, or by manually entering the target entity.
The annotation process was carried out by a paid expert annotator. To ensure
that the resulting annotations would be of high quality, we periodically collected
random annotated samples and examined their correctness. The annotator was
aware that we were doing this and the agreement was that s/he will only get paid
after the annotation process was completed. This process resulted in the annotation
of over 5000 target entities.
We created a set of tags similar to those defined by the IOB format [8] which is
used for tagging sentiment targets and non-targets. The sentiment target recognition
tagging scheme only has three tags: B-target, I-target, O-tag, and aword is tagged
with B-target if it is the first word (beginning) of a target entity in a tweet. The
I-target tag is used when the target is made up of a sequence of words. While the
first word in such a case is tagged with the B-target tag, the remaining words in the
sequence are tagged with I-target tag. Any word which is not a target is tagged with
the O-tag. This dataset is available by request from any of the authors. It will also be
availed online.
One of the goals of this work was to experiment with a bidirectional Long
Short-Term Memory (BI-LSTM) neural network. Previous work for text related
tasks has shown that the performance of such networks is usually enhanced by the
use of word embeddings. Word embeddings [16] refer to a model for feature
learning used in NLP tasks to transform input words into a clustered representation
of real numbered vectors. The model results in the construction of low dimensional
vector representations for all words in a given input text corpus. Words sharing
common contexts in that corpus are transformed into vectors that are close to each
other in the vector space. As such, it can be stated that word embeddings produce an
expressive representation for words as it captures their semantics.
Using word embeddings within deep learning models for NLP tasks such
semantic parsing; named entity recognition and sentiment analysis has been shown
to result in a great performance boost. These tasks usually use a limited set of
labeled instances for training purposes. Word embeddings are usually constructed
from huge corpora, so it is very beneficial in these tasks to make use of pre-trained
word embeddings in order to have a generalized model with a reliable estimation
for the problem parameters.
8 A. El-Kilany et al.
For the English language, it is quite easy for researchers wishing to experiment
with word embeddings to use publicly available pre-trained embeddings. However,
for Arabic such embeddings do not exist, so we had to build our own. Training a
word embeddings model usually requires a large input corpus. The corpus that we
have used consisted of fifty four million Arabic tweets collected over the duration of
4 month (during 2015) by researchers at Nile University. In this work, we have
applied two word embeddings generation models to generate a vector for each
corpus word.
The first model was built using word2vec [16] which was trained using the
skip-n-gram model. Building the model involved a series of experiments which
involved using different settings and configurations. The tool used for building the
word2vec model was Gensim1 which is a widely used Python library for topic
modeling and other text related tasks. At the end, we used a model which was set to
produce a vector of length 100, and whose parameters were set to a window size of
5 and a minimum word frequency cut-off of 5. The advantage of using this method
is that it preserves the contextual similarity of words. Moreover, word2vec learns
the words order from the context of the large corpus of tweets.
The second applied embeddings model uses character-level embeddings which
derives a representation for every character, to create word embeddings. The model
that we have used is similar to the one presented in [8]. At the beginning, the
character embeddings are randomly initialized. After that character embeddings are
constructed by passing each character in every word in the sentiment target
recognition task dataset to a bidirectional LSTM from the forward and backward
directions. The embedding of each word is generated from the concatenation of the
bidirectional LSTM representation of the word characters. The advantage of
the character embeddings model is that it allows for building task specific
embeddings which are generated from the labeled training data. In addition, it is
capable of handling the out-of-vocabulary problem.
The goal of our work was to experiment with a bidirectional Long Short-Term
Memory (BI-LSTM) neural network for addressing the task of sentiment target
recognition. However, since this task has not been addressed before for Arabic, we
needed to provide another baseline model to which we can compare obtained
results. So in this paper, we present two different models for sentiment target
recognition in Arabic tweets. The first model is the baseline one, which relies on a
windowing method to detect target entities. The second model presents our pro-
posed approach which is a sequence tagging model based on a hybrid approach
1
https://pypi.python.org/pypi/gensim.
Using Deep Neural Networks for Extracting Sentiment Targets … 9
Fig. 3 Algorithm for the Window-based method for detecting sentiment targets
10 A. El-Kilany et al.
Recurrent Neural Networks (RNNs) [7] represent a class of neural networks that
have the ability of preserving historical information from previous states through
loops or iterations. These loops allow a RNN to represent information about
sequences as the RNN takes a sequence of vectors (x1, x2, …, xt) as an input and
produces another sequence (h1, h2, …, ht) as an output that represents information
about the input sequence over time t.
RNNs have been applied to a variety of tasks including speech recognition [17]
and image captioning [18]. RNNs have also been used for NLP tasks such as
language modeling, named entity recognition (NER) [8], and machine translation
[19]. However, RNNs suffer from the problem of long-term dependencies which
means that RNNs struggle to remember information for long periods of time and
that in practice a RNN is biased to recent input vectors in an input sequence.
An LSTM [6] is a special kind of RNN that is designed to solve the long-term
dependency problem through using several gates called the forget gates. Adding
these gates gives an LSTM the ability to control the flow of the information to the
cell state. The forget gates prevent the back propagated errors from vanishing or
exploding points through a sigmoid layer which determines which information to
remember and which to forget. In this research, we have used the following
implementation of LSTM:
ht = ot ⊙ tanhðct Þ. ð4Þ
current word t while the second is responsible for providing the representation for
the right context through passing over the tweet in the reverse direction. Generally,
the first is called the forward LSTM and the latter is called the backward LSTM.
Each word in a bidirectional LSTM is represented through concatenating the output
of the forward and backward LSTM. The vector resulting from bidirectional LSTM
provides an effective representation for each word, as it takes context into con-
sideration in this representation.
5.2.3 BI-LSTM-CRF
N N
SðX, yÞ = ∑ Cyi, yi + 1 + ∑ Li, yi ð5Þ
i=0 i=1
A softmax function is applied to produce the probability for the tag sequence y:
eSðX, yÞ
PðyjX Þ = ∼
ð6Þ
s X, y
∑y∼
∈ Yx
e
As Yx represents the set of all possible tag sequence predictions for a sentence
X, the objective function for the sentiment target recognition model during training
aims to maximize the log-probability of the correct tag sequence as follows:
∼
logðPðyjX ÞÞ = SðX, yÞ − logadd ∼y ∈ Yx
SðX, y Þ ð7Þ
During prediction, the output sequence is chosen from the set of all possible
sequences that give the maximum score as follows:
∼
ypredicted = argmax ∼y ∈ Yx
s X, y ð8Þ
Using Deep Neural Networks for Extracting Sentiment Targets … 13
6 Performance Evaluation
The objective of our evaluation was to test our hypothesis that using a deep neural
network to discover a set of features for a challenging task such as sentiment target
recognition in a complex language like Arabic, can outperform a baseline solution
for the task.
Towards this goal, training and testing sets were derived from the annotated
Arabic tweets dataset previously described in data collection and annotation sec-
tion. From those, only 3000 were selected. Two thousands tweets were used for
building the training set while the rest (1000) were used for testing. Tweets in both
the training and testing sets were cleaned by removing all special characters. They
were also normalized as follows: “ “ﺇ,””ﺃ, and “ ”ﺁcharacters were replaced with “”ﺍ,
and “ ”ﻯwas replaced by “”ﻱ, while “ ”ﺓwas replaced by “”ﻩ
The sentiment target recognition tagging network with a bidirectional LSTM
augmented with a CRF layer described in the previous section was built by adapting
the implementation and network structure availed by the authors of [8] for tagging
named entities. Given the relatively small size of our training set, the network was
trained using 100 epochs in order to set the neuron weights.
To analyze the effect of the word2vec vectors on the results, the bidirectional
LSTM-CRF network was trained once with the word vectors from the word2vec
model concatenated with word vectors from the character embeddings model and once
with the word vectors from character embeddings model only. About 94% of words in
the training dataset had a vector representation in our set of word2vec embeddings.
Both our model and the previously described baseline method, aim at recog-
nizing sentiment targets in each Arabic tweet in the test set. The precision, recall
and F-score metrics were used to compare between the models. An annotated
sentiment target is considered to be detected by the model if any of its constituent
words is detected. Table 1 shows the scores for baseline method, bidirectional
LSTM-CRF with character embeddings only and bidirectional LSTM-CRF with
character embeddings and word2vec embeddings.
The results in Table 1 show the clear superiority of the bidirectional LSTM-CRF
network over the baseline method with respect to all metrics, especially when using
word embeddings. Handcrafting a set of features for sentiment target recognition is a
challenging task especially since most targets are simply a special case of a named
entity, but through training the bidirectional LSTM network, the network was able
to derive features that distinguish between sentiment target entities and other entities.
The results also show that both bidirectional LSTM-CRF networks (with and
without word embeddings) have achieved comparable precision, but the network
with the word2vec model has achieved a much higher recall. This means that the
bidirectional LSTM network with the word2vec model has discovered more target
entities from the tweets while conserving the same extraction precision. This
improvement achieved by the bidirectional LSTM-CRF network with the
word2vec model emphasizes the power of word embeddings for improving the
performance of NLP tasks.
7 Conclusion
This paper presented a model for detecting sentiment targets in Arabic tweets. The
main component of the model is a bidirectional LSTM deep neural network with a
conditional random fields (CRF) classification layer, and a word embeddings layer.
Results obtained by experimenting with the proposed model, show that the model is
capable of addressing the sentiment target recognition problem despite its inherent
challenges. The experiments also revealed that the use of a word2vec model
enhances the model’s results and contributes to the overall effectiveness of the
network in extracting sentiment targets. Comparing the proposed model against a
baseline method showed a considerable advantage in terms of results.
Acknowledgements We would like to thank Abu Bakr Soliman for building the user interface
used for annotating sentiment target data and for collecting the 54 million tweets that we have used
for building our word embeddings model. This work was partially supported by ITIDA grant
number [PRP]2015.R19.9.
References
1. Zayed, O., El-Beltagy, S.R.: Named entity recognition of persons’ names in Arabic tweets. In:
Proceedings of Recent Advances in Natural Language Processing (RANLP 2015), Hissar,
Bulgaria (2015)
2. Zayed, O., El-Beltagy, S.R.: A hybrid approach for extracting arabic persons’ names and
resolving their ambiguity from twitter. In: Métais, E. et al. (eds.) Proceedings of 19th
International Conference on Application of Natural Language to Information Systems
(NLDB2015), NLDB 2015, Lecture Notes in Computer Science (LNCS), Passau, Germany
(2015)
3. El-Beltagy, S.R., Khalil, T., Halaby, A., Hammad, M.H.: Combining lexical features and a
supervised learning approach for Arabic sentiment analysis. In: CICLing 2016, Konya,
Turkey (2016)
4. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv:
1508.01991 (2015)
5. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for
segmenting and labeling sequence data. In: Proceedings of the Eighteenth International
Conference on Machine Learning, ICML, pp. 282–289 (2001)
6. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780
(1997)
Using Deep Neural Networks for Extracting Sentiment Targets … 15
7. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. Trans. Sig. Proc. 45,
2673–2681 (1997)
8. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures
for named entity recognition. In: Proceedings of NAACL-HLT (NAACL 2016) (2016)
9. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of
words and phrases and their compositionality. In: Proceedings of the 26th International
Conference on Neural Information Processing Systems, pp. 3111–3119. Curran Associates
Inc., USA (2013)
10. Pontiki, M., Galanis, D., Papageorgiou, H., Androutsopoulos, I., Manandhar, S., AL-Smadi,
M., Al-Ayyoub, M., Zhao, Y., Qin, B., De Clercq, O., Hoste, V., Apidianaki, M., Tannier, X.,
Loukachevitch, N., Kotelnikov, E., Bel, N., Jiménez-Zafra, S.M., Eryi\ugit, G.:
SemEval-2016 task 5 : aspect based sentiment analysis. In: Proceedings of the 10th
International Workshop on Semantic Evaluation (SemEval-2016), pp. 19–30. Association for
Computational Linguistics (2016)
11. Khalil, T., Samhaa R. El-Beltagy: NileTMRG: Deep convolutional neural networks for aspect
category and sentiment extraction in SemEval-2016 Task 5. In: Proceedings of the 10th
International Workshop on Semantic Evaluation (SemEval 2016), San Diego, California,
USA (2016)
12. Mitchell, M., Aguilar, J., Wilson, T., Durme, B. Van: Open domain targeted sentiment. In:
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
pp. 1643–1654 (2013)
13. Zhang, M., Zhang, Y., Vo, D.-T.: Gated neural networks for targeted sentiment analysis. In:
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona,
USA. Association for the Advancement of Artificial Intelligence (2016)
14. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural
language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
15. El-Beltagy, S.R.: NileULex: a phrase and word level sentiment Lexicon for Egyptian and
modern standard Arabic. In: Proceedings of LREC 2016, Portorož, Slovenia (2016)
16. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in
vector space. ICLR Work. (2013)
17. Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory based recurrent neural network
architectures for large vocabulary speech recognition. CoRR. arXiv:1402.1128 (2014)
18. Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Deep captioning with multimodal recurrent
neural networks (m-RNN). CoRR. arXiv:1412.6632 (2014)
19. Zhang, B., Xiong, D., Su, J.: Recurrent neural machine translation. CoRR. arXiv:1607.08725
(2016)
Evaluation and Enrichment of Arabic
Sentiment Analysis
Abstract This paper presents an innovative approach that explores the role of
lexicalization for Arabic sentiment analysis. Sentiment Analysis in Arabic is hin-
dered due to lack of resources, language in use with sentiment lexicons,
pre-processing of dataset as a must and major concern is repeatedly following same
approaches. One of the key solution found to resolve these problems include
applying the extension of lexicon to include more words not restricted to Modern
Standard Arabic. Secondly, avoiding pre-processing of dataset. Third, and the most
important one, is investigating the development of an Arabic Sentiment Analysis
system using a novel rule-based approach. This approach uses heuristics rules in a
manner that accurately classifies tweets as positive or negative. The manner in
which a series of abstraction occurs resulting in an end-to-end mechanism with
rule-based chaining approach. For each lexicon, this chain specifically follows a
chaining of rules, with appropriate positioning and prioritization of rules. Expensive
rules in terms of time and effort thus resulted in outstanding results. The results with
end-to-end rule chaining approach achieved 93.9% accuracy when tested on
baseline dataset and 85.6% accuracy on OCA, the second dataset. A further com-
parison with the baseline showed huge increase in accuracy by 23.85%.
1 Introduction
The mission of Sentiment Analysis is to perceive the content with suppositions and
mastermind them in a way complying with the extremity, which incorporates:
negative, positive or nonpartisan. Organization’s are taken to huge prestige by
living up to the opinions from different people [1, 2]. Subjectivity and Sentiment
Analysis characterization are prepared in four measurements: (1) subjectivity
arrangement, to estimate on Subjective or Objective, (2) Sentiment Analysis, to
anticipate on the extremity that could be negative, positive, or impartial, (3) the
level in view of record, sentence, word or expression order, and (4) the method-
ology that is tailed; it could be standard based, machine learning, or half breed [3].
As stated by Liu [4], Sentiment Analysis is “Sentiment analysis, also called opinion
mining, is the field of study that analyzes people’s opinions, sentiments, evalua-
tions, appraisals, attitudes, and emotions towards entities such as products, services,
organizations, individuals, issues, events, topics, and their attributes.”
Arabic is a Semitic dialect, which is distinctive as far as its history, structure,
diglossic nature and unpredictability Farghaly and Shaalan [5]. Arabic is broadly
talked by more than 300 million individuals. Arabic Natural Language Processing
(NLP) is testing and Arabic Sentiment Analysis is not a special case. Arabic is
exceptionally inflectional [6, 7] because of the fastens which incorporates relational
words and pronouns. Arabic morphology is intricate because of things and verbs
bringing about 10,000 root [8]. Arabic morphology has 120 examples. Beesley [8]
highlighted the hugeness of 5000 roots for Arabic morphology. No capitalization
makes Arabic named substance acknowledgment a troublesome mission [9]. Free
order of Arabic Language brings in additional challenges with regards to Sentiment
Analysis, as the words in the sentence can be swapped without changing the
structure and the meaning. Arabic Sentiment Analysis has been a gigantic center for
scientists [10].
The target of this study is to examine procedures that decide negative and
positive extremity of the information content. One of the critical result would be to
recognize the proposed end-to-end principle binding way to deal with other dic-
tionary based and machine learning-construct approaches in light of the chosen
dataset.
The rest of this paper is organized as follows. Related work is covered in Sect. 2,
Challenges to perform Arabic Sentiment Analysis is traced in Sect. 3, Data col-
lection is covered in Sect. 4 followed by system implementation in Sect. 5. Sec-
tion 6 covers evaluation and results, Sect. 7 depicts conclusion.
Evaluation and Enrichment of Arabic Sentiment Analysis 19
2 Related Work
To take a shot at Sentiment Analysis, the key parameter is the dataset. Late endeavors
by Farra et al. [11] outlined the significance of crowdsourcing as an extremely fruitful
technique for commenting on dataset. Sentiment corpora for Arabic brought into
existence to prove the methods and perform experiments [12–16].
One of the extreme need to perform sentiment analysis is the availability of
corpus. Few attempts are significant in literature illustrating the researchers zest to
fill in the barren land of Arabic datasets to perform Arabic Sentiment Analysis.
A freely available corpus brought into existence with the aid of manual approach by
Al-Kabi et al. [17] covered five areas that is—Religion, Economy, Sport, Tech-
nology and Food-Life style, with each of these areas covering 50 subjects.
Al-Kabi et al. [17] collected reviews from Yahoo Maktoob with constraints
including a clear discretion in the way the Arabic reviews written, with a mixture of
Eygptian, English, MSA and Levantine dialect. Unlike Al-Kabi et al. [17] who have
manually compiled the dataset, Farra et al. [18] used crowdsourcing to annotate
dataset. Apart from freely available corpus, there has been significant work on
different corpora but not shared openly to our knowledge. However some other
corpus worked on includes Financial—[19, 20]; News—[21–23].
Other than the corpus, the other main player in the whole process of Arabic
Sentiment Analysis is the Lexicon. Lexicons are the most desirable word list
playing a key role in defining the text polarity. The lexicons can be available with
the inclusion of adjectives. Adjectives are the best way to highlight polarity [24].
Few notable efforts found in literature where lexicons brought to life included,
business based lexicons by [25], SentiStrength [26].
Some of the exemplar additions to the Arabic Sentiment Analysis include
exploration on different methods in different domains however, piled up with
limitations. Likewise, the work by Mountassir et al. [27] on documents with 2,925
reviews in Modern Standard Arabic wherein no experiments performed to judge
their proposed method. With the news webpages and social reviews, Al-Kabi et al.
[28] worked at sentence level by dealing with colloquial Arabic with a limitation of
tiny dataset. However, Elarnaoty et al. [21] worked on document level with only
center on news publications.
Bolster Vector Machine classifier accomplished 72.6% precision on twitter
dataset of 1000 tweets [29]. The record level assessment investigation utilizing a
joined methodology comprising of a dictionary and Machine Learning approach
with K-Nearest Neighbors and Maximum Entropy on a blended area corpus
involving training, legislative issues and games achieved an F-measure of 80.29%
[30]. Shoukry and Refea [29] achieved a precision of 72.6% with the corpus based
method. The vocabulary and estimation examination device with an exactness of
70.05% on tweeter dataset also, 63.75% on Yahoo Maktoob dataset [12].
20 S. Siddiqui et al.
With a mixed approach that is Lexical and Support Vector Machine classifier
created 84.01% exactness [31]. 79.90% exactness was shown with Hybrid
methodology which involved lexical, entropy and K-closest neighbor [32]. Shoukry
and Rafea [33] independently sent two methodologies, one being Support Vector
Machine accomplished a precision of 78.80% and other one being Lexical with an
exactness of 75.50%.
In the first place the challenges with regards to sentiment analysis in Arabic
includes a principal that has a quick effect is subjectivity portrayal. In light of the
very truth of a trim distinction in a sentence being subjective or objective gets new
challenges. Various diverse challenges are found in Sentiment Analysis. If a neg-
ative presumption contains a segment from the positive vocabulary, it is general
explored as positive limit wherein it is negative. A word found as a section of both
positive and negative vocabularies obtains new challenges in the furthest point
undertaking. For example, “( ”ﺑﺎ ﻗﺮﻑit makes me sick or disgusting) was found in
both negative and positive reviews; finally, joining positive and negative vocabu-
laries. A couple people create rude comments or unmistakable style of commenting
using negated sentences in a productive review thusly gets new troubles. For
example, “( ”ﺍﺣﺴﻦ ﻣﻦ ﻫﻴﻚ ﺭﺋﻴﺲ ﻭﺯﺭﺍﺀ ﻣﺎ ﻓﻲThere is no Prime Minister better than this
one).
As spoke to by Pang and Lee [34], positive words in one zone may hold a
substitute furthest point in another space. For example, a word like “( ”ﺑﺎﺭﺩcold)
were found in positive and negative reviews holding specific limit. Distinctive
challenges joins, beyond what many would consider possible to 140 characters
people winds up using short structures, semantic mistakes in the tweets and
specifically usage of slang tongue [31]. Following are few challenges hindering the
natural language processing in Arabic.
3.1 Encoding
Windows CP-1256 and Unicode availability, one can read, compose and extract
substance written in Arabic. Issues do rise in the midst of the preparation work by
some different tasks, as needs be the use of transliteration where the Roman letters
and Arabic letters mapping are possible. Subsequently, keeping in mind the end
Another random document with
no related content on Scribd:
mystère du sien, mais nous eûmes bientôt la clef de l’énigme. A nos
complets taillés et vendus dans Cheapside, elles nous prenaient
pour des sujets de la reine Victoria, qui, comme l’on sait, raffolent
des primeurs.
Ce fut sous ce double et juvénile pilotage qu’après force tours et
détours dans des ruelles d’un quartier inconnu et malpropre, pavé de
cailloux dangereux et pointus, nous arrivâmes sans encombre au
port.