Sentiment Analysis in Airline Tweets Using Mutual Information For Feature Selection

2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering
(ICITISEE)
Sentiment Analysis in Airline Tweets Using Mutual

Information for Feature Selection
Hastari Utama
Informatics Engineering
Universitas Amikom Yogyakarta
Yogyakarta, Indonesia
utama@amikom.ac.id
Abstract—Social media is a media that many users need to platforms such as mobile devices, websites, or desktop
be connected with other users in order to establish applications by connecting to the internet. This ease of access
communication. One of the most widely used social media is has resulted in a growing number of users over time.
Twitter. This Twitter contains opinions or short messages called
tweets. The invited company also needs feedback from its An airline is an organization that provides flight services
customers to find out their view of the requested service. for passengers or goods. They rent or own aircraft to provide
Therefore sentiment analysis is needed to collect sentiment these services and can form partnerships or alliances with
classification of the company. This research uses a dataset from other airlines for mutual benefit. The company needs feedback
a collection of tweets about US Airlines. Because the dataset has from its customers to find out their views on the services of
been provided in Kaggle and already has several metadata so the airline company. The desired feedback will be difficult to
the experiment about feature selection can be done quickly. The obtain if done alone using a questionnaire, sampling, or
features selection in this research uses the Mutual Information interviewing samples from customers.
method. The method was chosen because it opposes the previous
reference which the method is effective in correlation Opinions or opinions from customers and even the general
measurement from one attribute to another. The results public are usually found in some social media comments. One
obtained show that the training data created with features of the social media that contains these opinions is on Twitter.
selected using complementary information has better. However, However, Twitter has lots of tweets which are sometimes in
this mutual information when compared with the selection of the form of news and some have become opinions. This
other features such as Chi Square and Annova F to choose the opinion selection uses sentiment analysis to sort out some of
old process and verify use for both methods. the tweets included in the opinion.
Keywords—sentiment analysis, twitter, feature selection, mutual This study aims to discuss aspects of the approach in
information. sentiment analysis of airline datasets. The dataset is analyzed
to generate sentiment classification. However, the
I. INTRODUCTION representation of text in process of classification generates
Social media is become a reference for several companies high dimension of featues. So, it needs to reduce the
that aim to find a description about customer behavior today dimension of featues. Reducing dimension can be done using
[1]. The results of the analysis can also be used as a guideline feature selection. This research uses Mutual Information (MI)
for taking policies related to future business direction for featues selection. The selection of this method is based on
decisions for the company concerned. This analysis results in former references that reveal effective in correlation
the classification of opinions or sentiments obtained from measurement from one attribute to another. This is expected
tweets provided in the form of datasets by Kaggle. Therefore, to increase the accuracy and reduce time processing in training
the collection of texts from this tweeter is very valuable data of sentiment analysis. Furthermore, the final stage of this
because it contains hidden information and must be revealed research is expected to produce sentiment analysis software
for the purpose of the company. Disclosure of such that can visualize trends in airline service users.
information involves mining data with various types of
II. SENTIMENT ANALYSIS
classifiers as well. This data mining process has main task in
converting unstructured text data into structured data. Text Sentiment analysis can also be called opinion mining. It
data that has been structured is a requirement for the data uses language computing and data mining to convert data text
mining process. In addition, sentiment analysis also involves into information. Sentiment analysis has objective to detect a
a Natural Language Preprocessing approach to get meaning person's mood, behavior and opinions from existing text
from the collection of words in the tweet. documents [5]. This can be positive, negative, or neutral.
However, the data processed initially has an unstructured
One of social media that is often used is Twitter. Twitter form. This is the main challenge in sentiment analysis to
users currently reach 330 million people in the world and this transform unstructured into structured data that can be further
social media can also generates 8000 data in every second processed.
[2][3]. On Twitter, there is the term tweet which means a
collection of texts or sentences that can contain news content, Sentiment involves sentiment holder, emotional
opinions, arguments, and several other types of sentences [4]. disposition such as positive or negative polarity and objects.
It has the potential to store useful information for several This structure is shown in Figure 1. Sentiment or opinion is an
companies. Twitter users are not only from the hands of the expression of the nature of someone who gives an opinion or
youth but various circles even from government and business comment either in social media or other systems [6]. When the
circles as well. The usage of Twitter can be through various opinion owner encounters a situation where the object is
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Authorized licensed use limited to: Somaiya University. Downloaded on December 28,2023 at 13:30:14 UTC from IEEE Xplore. Restrictions apply.
978-1-7281-5118-2/19/$31.00 ©2019 IEEE 295
(ICITISEE)
involved, the opinion can be identified or traced through the method is the simplest way to make text representations
impact of interaction with the object or entity. These opinions to produce features..
have emotional tendencies which can be broken down into 2
types of positive and negative opinions. Sentiment analysis 2. CHI Statistics: CHI Square is one of the most commonly
has the task to produce 2 opinions from input of opinions that used word selection algorithms. This algorithm is done
do not have a previous label. based on statistical principles. The Chi Square method
attempts to measure the divergence of distributed data
based on the assumption that the appearance of text
features is very independent of sentiment class.
3. Information Gain: also one method that is often used in
text mining. The method of paying attention to large
amounts of text is uncertain. Thus, the Information Gain
method is important for text features that are measured in
accordance with a reduction in uncertainty if the value of
the text feature is unknown. This method uses entropy to
calculate text uncertainty.
4. Standard deviation: Standard deviation is one of the
methods in statistics that is used also in feature selection.
Fig. 1. Schematic Structure of Sentiment Analysis
This method attempt to find the value of means between
points or individuals that have been calculated. In this
III. FEATURE SELECTION case, a lower standard deviation is recommended, which
indicates that the resulting average value has a density for
Feature selection methods can be divided into several each point or individual.
methods which are lexicon-based methods that require
markers from humans and statistical methods that However, a low standard deviation indicates that the
automatically provide markers [7]. This usage statistical feature is close to other features.
method is often be done in the context of sentiment analysis. 5. Mutual Information: Mutual Information (MI) is a filter-
The lexicon-based approach usually starts with a number of based method that is often used. This method measures
collections of words. Then, it is obtained this collection of the importance of feature information by selecting
words through detection of synonyms or online resources to selected criteria with class labels. It is assumed that
get a larger lexicon. Feature selection technique tries to keep feature with strong correlation with unstable will improve
a document as a collection of words or called bag of words classification performance.
(BOW). It can also be thought of as a string that maintains the
ureness of words in a document. BOW is often used because IV. MUTUAL INFORMATION
of complexity in the process.
Mutual Information (MI) is a feature selection method.
The process of selecting this feature is shown in Figure 2. This method is an effective method for measuring correlations
Initially the text document that has been labeled is then between variables [10][11]. Suppose that x and y are discrete
converted into a vector that has many features based on a and pickled variables, then the variables x and y have many N
collection of words [8]. This is the process that was mentioned observations. Therefore the relationship of variables x and y
earlier, namely Bag of Word. The resulting features have a can be expressed using the following equation:
relatively large number. Therefore it is necessary to undo 𝑝(𝑥,𝑦)
features and leave features that have the highest value. This 𝐼(𝑥; 𝑦) = 𝐻(𝑦) − 𝐻(𝑦|𝑥) = ∑𝑥,𝑦 𝑝(𝑥, 𝑦) (1)
𝑝(𝑥)𝑝(𝑦)
feature with the highest value stores important information
than a feature with a lower value. The word document that has Where H (y) shows the entropy value of y which adds a
been downloaded can be continued to the data training stage. measure of uncertainty in discrete or discrete random
variables [10]. The function H (x|y) shows the entropy
condition of x given to y. Next, the function p (,) is a general
probability function. The MI method indicates how much x
and y information is loaded. The values x and y can be
independent or independent. So, they can be positive or equal
to zero. The minimum value of Redundancy Maximum
Relevance (mRMR) which directly uses MI for its redundancy
Fig. 2. Feature Selection Phases value and its relevance to involvement between variables and
one of the popular MI method development. The condition of
The main idea of various feature selections is the same. equation (1) can be illustrated in Figure 3. The ranking of the
Each algorithm calculates the value of each feature then criteria for the mRMR method can be expressed using the
important features are chosen according to the initial value or following equation:
the threshold value of the feature's score [8] [9] . In fact, there 1
are several feature selection methods that are often used, 𝐽𝑚𝑅𝑀𝑅 (𝐹𝑘 ) = max [𝐼 (𝑓𝑘 ; 𝐶) − ∑ 𝐼(𝑓𝑘 ; 𝑓1 )] (2)
𝑓1 ∈𝑆,𝑓𝑘 ∈𝐹−𝑆 |𝑆|
namely:
where there is Equation (1) which is a candidate and is
1. Document frequency (DF): In the Document Frequency represented by function I(,)[8]. Furthermore, all available
(DF) method represents the number of frequencies from features are represented by F while the selected features can
which words or documents appear in the term. This be symbolized by S; features found in S are represented by f1;
296
(ICITISEE)
then the class or label in the data is symbolized by C. In experiments are then discussed and compiled papers to be
Equation (2) shows the redundancy between candidate published in accordance with predetermined outcomes.
features. Furthermore, some of the selected feature candidates
have paired variables that do not pay attention to joint Figure 5 shows the flow of the sentiment analysis process
relationships and conditional redundancies determined by the in getting sentiment towards tweets. In the picture there are
third or more variables. several important steps to be taken as follows:
Fig. 3. Correlation MI Function

Fig. 4. Research Phases
In research on the classification of news topics, MI can
1. Tweet Crawling
also influence the performance of stemming in the text
This process is retrieval of tweet data using the Twitter
preprocessing phase [12]. The use of this method shows that
API. This tweet data is collected periodically so that it gets a
the more number of features that are provided to train the data
dataset. The dataset obtained was chosen to be a Corpus.
then it can decrease the level of accuracy. While the less
number of features, then the accuracy will increase. If a feature 2. Tokenizing
has a high MI value, this indicates that the feature has a high This stage tries to decipher a sentence or tweet into several
information value and is very necessary for the process of words. In addition, punctuation and url addresses are
training data. However, when a feature has a low MI value, removed. So, the result in this process is a collection of words
the feature is less necessary for the data training process. or also called terms.
Features that have a small MI value can be considered as
features that are less necessary in the training data. However, 3. Slang removal
if the feature is forced to produce a classification model it will This stage contains the process to eliminate slang words.
reduce the level of accuracy. The word slang is a new word that is not standard because of
the development of existing communication in the society.
V. PROPOSED METHOD The word slang gives rise to inappropriate meanings when
processing text. This phase applies the synonym method to
This research is included in the type of experimental
replace slang with standard words. So, the words in the dataset
research. The beginning of this research activity begins with a
are matched with the slang word list and its synonyms. If there
literature study on sentiment analysis. Then, the next passed
is a match, the word is the standard word which is a synonym.
phase is data collection. The collected data is downloaded
from the Kaggle website in the form of a dataset that has 4. Stop Word removal
several attributes or metadata. Text data taken from the dataset The terms and words produced are unique and do not cause
only on the airline_sentiment and text attributes. This data was redundancy. However, not all of these words are used. In these
analyzed using a descriptive approach. The results of the words there are words that do not give meaning. The word is
analysis of this data are used as input for the data training included in the stop word group. Words that are classified as
process. Then, the next phase is the design of sentiment stop words are usually in the form of conjunctions. These
analysis applications. The results of this design are used as a words need to be omitted to reduce the number of words used
reference in the construction phase. This construction phase in the data training process so as not to make long the process.
seeks to build a sentiment analysis application. The resulting
application is used to evaluate the Mutual Information method 5. Stemming
used. This stage is the process of changing the words that have
an influence into the basic words. This is done because to
In Figure 4, the research flow is shown. The research step reduce the number of words too. Basic and affixed words have
starts from the study of literature which intends to find the same meaning. At this stage the stemmming process is
references in the form of papers published in journals and carried out using the Porter Stemming method.
proceedings. Furthermore, data collection is carried out both
by documentation and observation. The collected data is then 6. Feature Extraction
analyzed and selected data for preprocessing materials. Next, Before the data training process, it is necessary to have a
a sentiment analysis system design is conducted to feature extraction process. Text collection is a form of
experiment. The available data is prepared in the form of a unstructured data so it needs to be broken down into several
dataset. This dataset is used for the trial process. Experiments features so that sentiment classification can be done.
carried out must have parameters. The experimental Therefore there is a need for the process of breaking text into
parameters that will be used are processing time, number of several features or attributes.
features, amount of data, and accuracy. The results of these 7. Feature Selection
297
(ICITISEE)
Feature selection is needed to choose features that have dataset with dimensions of 14641 x 10029. So, the features
high value. All features that are formed do not need to be used formed from the Bag of Word process are 10029. Because the
all for the classification process because many features will example in Table II uses 10 total features, those features are
require a long processing time. In this study using feature 10 features with the largest value from MI calculations which
selection with mutual information methods. Then, the features originally numbered 10029. The algorithm for recording
calculated by the method will be retained, whereas features features that have been calculated with MI is to use
that have less value are not used. SelectKBest. This algorithm will leave k features that are
determined at the beginning with the MI value of each best
feature of all the existing features.
TABLE II. MI SCORE FOR 10 FEATURES SELECTED

MI
Feature Label Score
1 neutral 0.20576
2 positive 0.16544
3 neutral 0.08375
4 negative 0.08375
5 negative 0.07264
6 negative 0.05764
7 positive 0.04629
8 positive 0.04622
9 positive 0.04404
Fig. 5. Design of Sentiment Analysis 10 positive 0.0435
VI. RESULT AND DISCUSSION C. Comparission of Accuracy Evaluation

A. Dataset Structure This sub discusses about the accuracy comparison. The
In this research was conducted involving data sets of US results of the accuracy comparison are shown in Tables III
airlines. The dataset was made up of 2 files where the structure and IV. In Table III there are results of testing the accuracy
is shown in Table 1. In Table 1 there are 2 files that are of using 5 different types of classifiers namely Naive Bayes,
datasets. This dataset comes from the same source. However, SVM Linear, SVM RBF Regression Logistic, and Decision
the preprocessing that is done only displays the tweet attribute. Tree. This accuracy test was conducted on 3340 x 4403 and
After preprocessing which consists of tokenization, stop word 14641 x 10029 datasets. This test has not been done using the
removal, and stemming, a list of words is generated. This word MI feature selection. The test results in Table III show that
list is then formed into a bag of words so that the size of the the best average accuracy is the one that has the largest
data (line x attribute) becomes as follows: dataset size, which is 14641 x 10029. However, the average
Tweets.csv = 14641 x 10029 processing time for data training is faster in the 3340 x 4403
dataset. Furthermore, the best classifier for both datasets Such
Train.csv = 3340 x 4403 is the Linear SVM. However, the fast training data process
TABLE I. DATASET STRUCTURE
lies in the use of the Naive Bayes method. So, the calculation
of the accuracy test without the selection of this feature can
No. Filename Source Year Size Dimension be stated that, the more or the larger the dataset, the longer
(row x
column)
the training process. Furthermore, the increase in the number
1. Tweets.csv https://www.kagg 2015 1.13MB 14641 x 15 of datasets also causes the accuracy of each method to
le.com/crowdflo increase.
wer/twitter-
airline-sentiment TABLE III. ACCURACY TEST FOR DATASET 3340 X 4403
2. train.csv https://www.kagg 2015 227KB 3340 x 12
le.com/tango911/ Time
Accuracy Error Standard
airline-sentiment- No. Classifier Process
(%) (%) Deviation
tweets (ms)
1 Naïve Bayes 65.86 34.14 0.00882 0.00653
2 SVM Linear 72.66 27.34 0.01465 6.39225
B. MI Score Generation 3 SVM RBF 70.5 29.5 0.02667 10.10942
In this research, the calculation of MI values for each Regression
attribute was performed. However, in Table II only the 4 Logistic 71.64 28.36 0.03314 4.04008
calculation results for the 10 selected features were shown. Decision
5 Tree 63.22 36.78 0.01098 2.91547
The selection of 10 features is still using the mutual 4.69275
Averages 68.776 31.224 0.01885
information method. The selected features are taken from the
298
(ICITISEE)
TABLE IV. ACCURACY TEST FOR DATASET 14641 X 10029

Time
Accuracy Error Standard
No. Classifier Process
(%) (%) Deviation
(ms)
1 Naïve Bayes 67.985 32.015 0.01304 0.0251
2 SVM Linear 76.913 23.087 0.02026 103.479
3 SVM RBF 76.161 23.839 0.02768 146.107
Regression
4 Logistic 76.776 23.224 0.02704 18.5619
Decision
5 Tree 69.037 30.963 0.02942 16.3178
Averages 73.3744 26.6256 0.02349 56.8982
Fig. 7. Comparison of The Feature Selection Method Time Processing
D. Time Process for Training Data Using Selction Method
VII. FUTURE WORK
Previous tests showed the fastest classifier was Naive
Bayes. Therefore, this classifier was chosen as a sample for The MI method will look for the value of each feature.
the feature selection process. In addition, this classifier is Features that have high value are features that carry important
information. This will affect the value of accuracy in
expected to produce test data faster because the data training
sentiment analysis. However, the highest value of the results
process is faster. In this section the feature selection process
of the calculation of mutual information is spread on existing
is performed using the MI method and compared with the Chi features. This requires certain methods to select those features
and Anova F. statistical methods. In addition, a dataset of size with MI calculations more effectively. In this study still
14641 x 10029 is made a sample for this feature selection test. checking the gradual MI value, from beginning to end.
The test starts with a thousand-sized feature up to 10029. The Therefore, more effective methods are needed to obtain
results of this test are shown in Table V. features with high MI values.
The results of feature selection testing using the MI, Chi
Square and F Anova methods can be represented in graphical VIII. CONCLUSION
form as in Figure 6 and 7. In Figure 6 shows the level of In this study, several things can be concluded as follows that
accuracy of the selected features. The selected feature starts the Naive Bayes classification has faster training data on the
from 1000. When selecting this feature it goes to the exact airline dataset. The SVM Linear Classifier has the best
number that is 10029 then there is a tendency for decreased accuracy in classifying tweets in this airline dataset. Features
accuracy. This happened not only to MI but to Chi Square with an MI calculation value greater than other features
and F Anova. This happens because not all features contain indicate that the feature contains important information. The
important information. Features that contain less important use of features that have less tendency to train data processing
information can result in reduced accuracy. However, there time faster.
are different things about the MI method. The initial feature
ACKNOWLEGEMENT
selection that starts with the number 1000 does not indicate
the highest accuracy value. However, when the feature shows I say thank you to all the members of Amikom University
a value of 2000, the accuracy increases. So, there needs to be Yogyakarta who have supported the implementation of this
further analysis to determine the number of features that have research. In addition, I as a researcher also expressed my
the best accuracy. In Table V shows that the average accuracy gratitude to all parties that I cannot mention one by one.
using the MI method is 69,763%. This is lower than the Chi REFERENCES
square and Anova F. methods. However, when tested for
processing time, MI requires an average of 0.02298 seconds [1] S. Y. Yoo, J. I. Song, and O. R. Jeong, “Social media contents based
to train the data. This processing time shows that it is faster sentiment analysis and prediction system,” Expert Syst. Appl., vol.
than using Chi Square and longer than using the Annova F 105, pp. 102–111, 2018.
method. [2] A. D. Laksito et al., “A Comparison Study of Search Strategy on
Collecting Twitter Data for Drug Adverse Reaction,” 2018 Int. Semin.
Appl. Technol. Inf. Commun., pp. 356–360, 2018.
[3] M. Daniel, R. F. Neves, and N. Horta, “Company event popularity for
financial markets using Twitter and sentiment analysis,” Expert Syst.
Appl., vol. 71, 2017.
[4] I. Chaturvedi, E. Cambria, R. E. Welsch, and F. Herrera,
“Distinguishing between facts and opinions for sentiment analysis:
Survey and challenges,” Inf. Fusion, vol. 44, pp. 65–77, 2018.
[5] R. Wagh, “Survey on Sentiment Analysis using Twitter Dataset,”
2018 Second Int. Conf. Electron. Commun. Aerosp. Technol., no.
Iceca, pp. 208–211, 2018.
Fig. 6. Comparison of The Feature Selection Method Accuracy
[6] M. Soleymani, D. Garcia, B. Jou, B. Schuller, S. F. Chang, and M.
Pantic, “A survey of multimodal sentiment analysis,” Image Vis.
Comput., vol. 65, pp. 3–14, 2017.
299
(ICITISEE)
[7] W. Medhat, A. Hassan, and H. Korashy, “Sentiment analysis [11] M. J. Tian, R. Y. Cui, and Z. H. Huang, “Automatic Extraction
algorithms and applications: A survey,” Ain Shams Eng. J., 2014. Method for Specific Domain Terms Based on Structural Features and
[8] Y. Liu, J. W. Bi, and Z. P. Fan, “Multi-class sentiment classification: Mutual Information,” Proc. - 2018 5th Int. Conf. Inf. Sci. Control Eng.
The experimental comparisons of feature selection and machine ICISCE 2018, pp. 147–150, 2019.
learning algorithms,” Expert Syst. Appl., vol. 80, pp. 323–339, 2017. [12] F. S. Nurfikri, M. S. Mubarok, and Adiwijaya, “News topic
[9] A. Yousefpour, R. Ibrahim, and H. N. A. Hamed, “Ordinal-based and classification using mutual information and Bayesian network,” 2018
frequency-based integration of feature selection methods for 6th Int. Conf. Inf. Commun. Technol. ICoICT 2018, vol. 0, no. c, pp.
sentiment analysis,” Expert Syst. Appl., 2017. 162–166, 2018.
[10] Y. Wang, S. Cang, and H. Yu, “Mutual Information Inspired Feature

Selection Using Kernel Canonical Correlation Analysis,” Expert Syst.
with Appl. X, p. 100014, 2019.
TABLE V. TEST RESULT FOR SELECTION FEATURES METHOD
Mutual Information Chi Square Annova F

Time Time
Accuracy Processing Accuracy Processing Accuracy Time
Features (%) (s) (%) (s) (%) Processing (s)
1000 71.093 0.02072 71.919 0.03124 72.227 0.02064
2000 71.584 0.02089 71.605 0.02549 71.66 0.02242
3000 71.564 0.02356 70.895 0.03234 71.284 0.02155
4000 70.526 0.02089 70.567 0.02486 70.485 0.02161
5000 70.328 0.0258 70.376 0.02491 70.403 0.02181
6000 69.795 0.02236 69.938 0.02317 69.945 0.02431
7000 69.406 0.02525 69.474 0.0263 69.522 0.02265
8000 68.845 0.0214 68.975 0.02575 68.914 0.02233
9000 68.285 0.023 68.483 0.0305 68.395 0.0227
10000 67.978 0.02383 68.005 0.04826 68.005 0.02173
10029 67.985 0.0251 67.985 0.0251 67.985 0.0251
Averages 69.763 0.02298 69.838 0.0289 69.893 0.02244
300

Sentiment Analysis in Airline Tweets Using Mutual Information For Feature Selection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sentiment Analysis in Airline Tweets Using Mutual Information For Feature Selection

Uploaded by

Copyright:

Available Formats

2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering

Sentiment Analysis in Airline Tweets Using Mutual

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Fig. 3. Correlation MI Function

TABLE II. MI SCORE FOR 10 FEATURES SELECTED

Fig. 5. Design of Sentiment Analysis 10 positive 0.0435

VI. RESULT AND DISCUSSION C. Comparission of Accuracy Evaluation

TABLE IV. ACCURACY TEST FOR DATASET 14641 X 10029

[10] Y. Wang, S. Cang, and H. Yu, “Mutual Information Inspired Feature

TABLE V. TEST RESULT FOR SELECTION FEATURES METHOD

Mutual Information Chi Square Annova F

You might also like