Professional Documents
Culture Documents
Previously used techniques for sentiment this new method is that while it can efficiently
classification can be classified into three categories. classify items, it is very computationally expensive to
These include machine learning algorithms, link conduct the initial feature selection, since both GA
analysis methods, and score based approaches. The and IG are expensive to run. Genetic algorithms are
effectiveness of machine learning techniques when search heuristics that are similar to the process of
applied to sentiment classification tasks is evaluated biological evolution and natural selection and
in the pioneering research by Pang et al, 2002. Many survival of the fittest. Genetic Algorithms (GAs) are
studies have used machine learning algorithms with probabilistic search methods. GAs are applied for
support vector machines (SVM) and Naïve Bayes natural selection and natural genetics in artificial
(NB) being the most commonly used. SVM has been intelligence to find the globally optimal solution from
used extensively for movie reviews (Pang et al, 2002; the set of feasible solutions (S Chandrakala et al,
Pang and Lee, 2004; Whitelaw et al., 2005) while 2012). The experiments with GA’s start with a large
Naïve Bayes has been applied to reviews and web set of possible extractable syntactic, semantic and
discourse (Pang et al, 2002; Pang and Lee, 2004; discourse level feature set. The fitness function
Efron, 2004). In comparisons, SVM has calculates the accuracy of the subjectivity classifier
outperformed other classifiers such as NB (Pang et based on the feature set identified by natural selection
al., 2002). Hesham Arafat et al., (2014) results show through the process of crossover and mutation after
that mRMR is better compared to IG for sentiment each generation. The ensemble technique, which
classification, Hybrid feature selection method based combines the outputs of several base classification
on the RST and Information Gain (IG) is better models to form an integrated output, has become an
compared to the previous methods. Proposed effective classification method for many domains (T.
methods are evaluated on four standard datasets viz. Ho, 1994; J. Kittler,, 1998). In topical text
Movie review, Product (book, DVD, and electronics) classification, several researchers have achieved
reviewed datasets, and Experimental results show improvements in classification accuracy via the
that hybrid feature selection method outperforms than ensemble technique. In the early work (L. Larkey et
feature selection methods for sentimental al, 1996), a combination of different classification
classification. Sumathi T et al., (2013) have algorithms (k-NN, Relevance feedback and Bayesian
compared three methods RIDOR, Naïve Bayes and classifier) produces better results than any single type
FURIA. Further it can extent to improve the of classifier. Freund and Schapire (1995,1996)
performance of the system using feature reduction proposed an algorithm the basis of which is to
technique. Also 400 reviews were randomly selected adaptively resample and combine (hence the
from IMDb dataset and feature extracted using stop acronym--arcing) so that the weights in the
word, stemming and IDF. The performance of resampling are increased for those cases most often
FURIA classifier is better than Naïve Bayes by 8.21 misclassified and the combining is done by weighted
% and by 21.71 compared to RIDOR. Jotheeswaran voting. In this research work, proposes a new hybrid
et al., (2012) proposed feature set extraction from method for sentiment mining problem. A new
movie reviews. Inverse document frequency is architecture based on coupling classification methods
computed and feature set reduced using Principal (NB and GA) using arcing classifier adapted to
Component Analysis. Pre processing’s effectiveness sentiment mining problem is defined in order to get
is evaluated using Naive Bayes and Linear Vector better results.
Quantization. Kabinsinghaetal., (2012) investigated
movie ratings. Data mining was applied to movie 3. Proposed Methodology
classification. Movies are rated into PG, PG-13 and R
in the prototype. The 240 prototype movies from Several researchers have investigated the
IMDb (http://imdb.com) were used. The other work combination of different classifiers to from an
that used sophisticated feature selection was by ensemble classifier (D. Tax et al, 2000). An
Abbasi et al. (2008). They found that using either important advantage for combining redundant and
information gain (IG) or genetic algorithms (GA) complementary classifiers is to increase robustness,
resulted in an improvement in accuracy. They also accuracy, and better overall generalization. This
combined the two in a new algorithm called the research work aims to make an intensive study of the
Entropy Weighted Genetic Algorithm (EWGA), effectiveness of ensemble techniques for sentiment
which achieved the highest level of accuracy in classification tasks. In this work, first the base
sentiment analysis to date of 91.7%. The drawback of classifiers such as Naïve Bayes (NB), Genetic
140
International Journal of Advanced Computer Research (ISSN (print): 2249-7277 ISSN (online): 2277-7970)
Volume-3 Number-4 Issue-13 December-2013
Algorithm (GA) are constructed to predict example, using the Porter‟s stemmer, the English
classification scores. The reason for that choice is word “generalizations” would subsequently be
that they are representative classification methods stemmed as “generalizations → generalization →
and very heterogeneous techniques in terms of their generalize → general → gener”. In cases where the
philosophies and strengths. All classification source documents are web pages, additional pre-
experiments were conducted using 10 × 10-fold processing is required to remove / modify HTML and
cross-validation for evaluating accuracy. Secondly, other script tags.
well known heterogeneous ensemble techniques are
performed with base classifiers to obtain a very good Feature extraction / selection helps identify important
generalization performance. The feasibility and the words in a text document. This is done using methods
benefits of the proposed approaches are demonstrated like TF-IDF (term frequency-inverse document
by means of movie review that is widely used in the frequency), LSI (latent semantic indexing), multi-
field of sentiment classification. A wide range of word etc. In the context of text classification, features
comparative experiments are conducted and finally, or attributes usually mean significant words, multi-
some in-depth discussion is presented and words or frequently occurring phrases indicative of
conclusions are drawn about the effectiveness of the text category.
ensemble technique for sentiment classification. This
research work proposes new hybrid method for After feature selection, the text document is
sentiment mining problem. A new architecture based represented as a document vector, and an appropriate
on coupling classification methods using arcing machine learning algorithm is used to train the text
classifier adapted to sentiment mining problem is classifier. The trained classifier is tested using a test
defined in order to get better results. The main set of text documents. If the classification accuracy of
originality of the proposed approach is based on five the trained classifier is found to be acceptable for the
main parts: Pre-processing phase, Document test set, then this model is used to classify new
Indexing phase, feature reduction phase, instances of text documents.
classification phase and combining phase to
aggregate the best classification results. B. Document Indexing
Creating a feature vector or other representation of a
A. Data Pre-processing document is a process that is known in the IR
Different pre-processing techniques were applied to community as indexing. There are a variety of ways
remove the noise from out data set. It helped to to represent textual data in feature vector form;
reduce the dimension of our data set, and hence however most are based on word co-occurrence
building more accurate classifier, in less time. patterns. In these approaches, a vocabulary of words
is defined for the representations, which are all
The main steps involved are i) document pre- possible words that might be important to
processing, ii) feature extraction / selection, iii) classification. This is usually done by extracting all
model selection, iv) training and testing the classifier. words occurring above a certain number of times
(perhaps 3 times), and defining your feature space so
Data pre-processing reduces the size of the input text that each dimension corresponds to one of these
documents significantly. It involves activities like words. When representing a given textual instance
sentence boundary determination, natural language (perhaps a document or a sentence), the value of each
specific stop-word elimination and stemming. Stop- dimension (also known as an attribute) is assigned
words are functional words which occur frequently in based on whether the word corresponding to that
the language of the text (for example, „a‟, ‟the‟, ‟an‟, dimension occurs in the given textual instance. If the
‟of‟ etc. in English language), so that they are not document consists of only one word, then only that
useful for classification. Stemming is the action of corresponding dimension will have a value, and
reducing words to their root or base form. For every other dimension (i.e., every other attribute) will
be zero. This is known as the ``bag of words''
English language, the Porter‟s stemmer is a popular approach. One important question is what values to
algorithm, which is a suffix stripping sequence of use when the word is present. Perhaps the most
systematic steps for stemming an English word, common approach is to weight each present word
reducing the vocabulary of the training text by using its frequency in the document and perhaps its
approximately one-third of its original size. For frequency in the training corpus as a whole. The most
141
International Journal of Advanced Computer Research (ISSN (print): 2249-7277 ISSN (online): 2277-7970)
Volume-3 Number-4 Issue-13 December-2013
common weighting function is the tfidf (term Here the model fails to account for multiple
frequency-inverse document frequency) measure, but occurrences of words within the same document,
other approaches exist. In most sentiment which is a more simple model. However, if multiple
classification work, a binary weighting function is word occurrences are meaningful, then a multinomial
used. Assigning 1 if the word is present, 0 otherwise model should be used instead, where a multinomial
has been shown to be most effective. distribution accounts for multiple word occurrences.
C. Dimensionality Reduction Here, the words become the events.
Dimension Reduction techniques are proposed as a
data pre-processing step. This process identifies a 2) Genetic Algorithm (GA)
suitable low-dimensional representation of original The genetic algorithm is a model of machine learning
data. Reducing the dimensionality improves the which derives its behaviour from a metaphor of some
computational efficiency and accuracy of the data of the mechanisms of evolution in nature. This done
analysis. by the creation within a machine of a population of
Steps: individuals represented by chromosomes, in essence
Select the dataset. a set of character strings. The individuals represent
Perform discretization for pre-processing candidate solutions to the optimization problem being
the data. solved. In genetic algorithms, the individuals are
Apply Best First Search algorithm to filter typically represented by n-bit binary vectors. The
out redundant & super flows attributes. resulting search space corresponds to an n–
Using the redundant attributes apply dimensional boolean space. It is assumed that the
classification algorithm and compare their quality of each candidate solution can be evaluated
performance. using a fitness function. Genetic algorithms use some
Identify the Best One. form of fitness-dependent probabilistic selection of
individuals from the current population to produce
1) Best first Search individuals for the next generation. The selected
Best First Search (BFS) uses classifier evaluation individuals are submitted to the action of genetic
model to estimate the merits of attributes. The operators to obtain new individuals that constitute the
attributes with high merit value is considered as next generation. Mutation and crossover are two of
potential attributes and used for classification the most commonly used operators that are used with
Searches the space of attribute subsets by augmenting genetic algorithms that represent individuals as
with a backtracking facility. Best first may start with binary strings. Mutation operates on a single string
the empty set of attributes and search forward, or and generally changes a bit at random while
start with the full set of attributes and search crossover operates on two parent strings to produce
backward, or start at any point and search in both two offsprings. Other genetic representations require
directions. the use of appropriate genetic operators.
6. Conclusions
B. Results and Discussion
In this research, a new hybrid technique is
Table 1: The performance of base and hybrid investigated and evaluated their performance based
classifier for movie review data on the movie review data and then classifying the
reduced data by NB and GA. Next a hybrid NB-GA
Dataset Classifiers Accuracy model and NB, GA models as base classifiers are
Movie- Naïve Bayes (NB) 91.15 % designed. Finally, a hybrid system is proposed to
Review Genetic Algorithm 91.25 % make optimum use of the best performances
Data (GA) delivered by the individual base classifiers and the
Proposed Hybrid 93.80 % hybrid approach. The hybrid NB-GA shows higher
NB-GA percentage of classification accuracy than the base
Method classifiers and enhances the testing time due to data
dimensions reduction. The experiment results lead to
Accuracy for Classification Methods in Movie Review Data the following observations.
144
International Journal of Advanced Computer Research (ISSN (print): 2249-7277 ISSN (online): 2277-7970)
Volume-3 Number-4 Issue-13 December-2013
Second European Conference on Computational [16] D. Tax, M. Breukelen, R. Duin, and J. Kittler,
Learning Theory, pp. 23-37. (2000), Combining multiple classifiers by
[6] Freund, Y. and Schapire, R. (1996), averaging or by mutliplying?, Pattern
“Experiments with a new boosting algorithm", In Recognition, Vol 33, pp. 1475-1485.
Proceedings of the Thirteenth International [17] Whitelaw, C., Garg, N., and Argamon, S. (2005),
Conference on Machine Learning, Bari, Italy, “Using appraisal groups for sentiment analysis”,
pp.148-156. In Proceedings of the 14th ACM Conference on
[7] Hesham Arafat,Rasheed M.Elawady,Sherif Information and Knowledge Management, pp.
Barakat,Nora M.Elrashidy., (2014), "Different 625-631.
Feature Selection for Sentiment Classification",
International Journal of Information Science and
Intelligent System, vol 3issue 1, pp. 137-150.
[8] T. Ho, J. Hull, S. Srihari, (1994), “Decision
combination in multiple classifier systems”, IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 16, pp. 66–75. Dr. M.Govindarajan received the B.E
[9] Jotheeswaran, J., Loganathan, R., and and M.E and Ph.D Degree in Computer
MadhuSudhanan, B, (2012), “Feature Reduction Science and Engineering from
using Principal Component Analysis for Opinion Annamalai University, Tamil Nadu,
Mining”, International Journal of Computer India in 2001 and 2005 and 2010
Science and Telecommunications ,Vol.3, respectively. He did his post-doctoral
No.5,pp.118-121. research in the Department of
[10] S.Kabinsingha,S., Chindasorn,C., and Computing, Faculty of Engineering and
Chantrapornchai.A, (2012), “Movie Rating Physical Sciences, University of Surrey, Guildford, Surrey,
Approach and Application Based on Data Author’s
United Ph oto in 2011 and pursuing Doctor of Science at
Kingdom
Mining”, International Journal of Engineering Utkal University, orissa, India. He is currently an Assistant
and Innovative Technology ,Vol. 2, No.1, pp.77- Professor at the Department of Computer Science and
83. Engineering, Annamalai University, Tamil Nadu, India. He
[11] J. Kittler, (1998), “Combining classifiers: a has presented and published more than 75 papers at
theoretical framework”, Pattern Analysis and Conferences and Journals and also received best paper
Applications, 1, pp.18–27. awards. He has delivered invited talks at various national
[12] L. Larkey, W. Croft, (1996), “Combining and international conferences. His current Research
classifiers in text categorization”, in: Proceeding Interests include Data Mining and its applications, Web
of ACM SIGIR Conference, ACM, New York, Mining, Text Mining, and Sentiment Mining. He was the
NY, USA, pp. 289–297. recipient of the Achievement Award for the field and to the
[13] B. Pang, L. Lee, S. Vaithyanathan, (2002), Conference Bio-Engineering, Computer science,
“Thumbs up? Sentiment classification using Knowledge Mining (2006), Prague, Czech Republic,
machine learning techniques”, in: Proceedings of Career Award for Young Teachers (2006), All India
the Conference on Empirical Methods in Natural Council for Technical Education, New Delhi, India and
Language Processing (EMNLP), pp. 79–86. Young Scientist International Travel Award (2012),
[14] Pang, B., and Lee, L. (2004), “A sentimental Department of Science and Technology, Government
education: Sentimental analysis using subjectivity of India New Delhi. He is Young Scientists awardee
summarization based on minimum cuts”, In under Fast Track Scheme (2013), Department of Science
Proceedings of the 42nd Annual Meeting of the and Technology, Government of India, New Delhi and
Association for Computational Linguistics, pp. also granted Young Scientist Fellowship (2013), Tamil
271-278. Nadu State Council for Science and Technology,
[15] Sumathi T, Karthik S, Marikannan M, (2013), Government of Tamil Nadu, Chennai. He has visited
"Performance Analysis of Classification Methods countries like Czech Republic, Austria, Thailand, United
for Opinion Mining", International Journal of Kingdom, Malaysia, U.S.A, and Singapore. He is an active
Innovations in Engineering and Technology Member of various professional bodies and Editorial Board
(IJIET) Vol. 2 Issue 4, pp 171-177. Member of various conferences and journals.
145